<<

AN EXPLORATION OF THE FIRST IN

Ashley Spangler

A Thesis

Submitted to the Graduate College of Green State University in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

May 2017

Committee:

James Albert, Advisor

Christopher Rump

John Chen

© 2017

Ashley Spangler

All Rights Reserved iii ABSTRACT

James Albert, Advisor

Sabermetrics is the statistical analysis of baseball. This research was started in the 1950’s and since then has become increasingly popular. Over the last couple of years, the availability of data within the sport of baseball has exploded. From mainly three sources, we have access to a vast arrange of statistics. This research investigates the importance of and the first pitch in baseball. The first pitch determines whether the hitter or the has the advantage in the at- bat and can set the precedence for the rest of the at-bat.

Exploratory methods are used to investigate and summarize the relationships between various variables through the use of tables, contour plots, scatterplots, and line graphs. As the pitcher’s thrown first pitch strike percentage increases, the number of pitched per game increases, Walks per Hits per (WHIP) decreases, decreases, and percentage increases. 64% of the first pitches thrown are either four-seam or two-seam or sliders, which are all fast pitches. Over 50% of the first pitches are in the .

Singles, doubles, triples, and homeruns are more likely to be on the first pitch. have the highest pitching statistics when the hitter swings and misses compared to putting the ball in play, a called strike, or a ball on the first pitch. When the first pitch is a ball, the hitters have the highest hitting statistics.

Generalized Additive Models (GAM) and Logistic Regression Models are used to discover the factors significant in predicting the probability that hitters swing. Logistic models were created for all pitches and then first pitches for all players. Next, four logistic models were created for four different players. In the majority of the models, count type (whether the count iv favored the pitcher, hitter, or was neutral), the distance in feet of the pitch from the center of the strike zone, and if runners were on base or not were significant in predicting the probability of swinging. Overall results suggest that hitters have different hitting strategies and swinging on the first pitch, or in general, depends on the hitter. More mature hitters tend to not swing on the first pitch. v

In honor of you, Dad. vi ACKNOWLEDGMENTS

I would like to express my deepest appreciation to my advisor, Dr. James Albert, for his overwhelming support, assistance, and encouragement throughout the whole project. I would not have been able to get through all the struggles and frustrations without his patience and extensive baseball knowledge. I would also like thank Dr. Christopher Rump and Dr. John Chen for serving on my committee.

Thanks to my fellow Bowling Green State University graduate students and professors. I would not have been able to get through graduate school without all of your support, advice, and knowledge.

Finally, I must express my very profound gratitude to my family, fiancé, and mom for providing me with unfailing support and continuous encouragement throughout my years of school and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. vi

TABLE OF CONTENTS

Page

SECTION 1: INTRODUCTION ...... 1

1.1 Description of the Game of Baseball ...... 1

1.2 Connection between Statistics and Baseball ...... 3

1.3 Availability of Baseball Data ...... 5

SECTION 2: IMPORTANCE OF AND THE FIRST PITCH ...... 8

2.1 Previous Work ...... 11

2.2 Different Philosophies for Hitters and Pitchers ...... 15

SECTION 3: RESEARCH DESIGN ...... 19

3.1 Research Questions ...... 19

3.2 Methodological Design...... 20

SECTION 4: EXPLORATORY WORK OF THE FIRST PITCH...... 22

4.1 Importance of the First Pitch Example ...... 22

4.2 Characteristics of the First Pitch ...... 33

4.2.1 Percentage of First Pitch Strikes Thrown and Swing Rate of First

Pitches ...... 33

4.2.2 Pitch Type Classification ...... 34

4.2.3 Pitch Type Thrown on the First Pitch...... 37

4.2.4 Percentage of and Walks for Plate Appearances Passing

Through 0-1 vs 1-0 Count ...... 43

4.3 Pitch Location by Stance and Count ...... 43

4.4 Hit Quality ...... 48 vii

SECTION 5: FIRST PITCH-HITTER PERSPECTIVE ...... 50

5.1 Average...... 50

5.2 On-Base Percentage ...... 52

5.3 ...... 54

5.4 On-Base Plus Slugging...... 55

SECTION 6: FIRST PITCH-PITCHER PERSPECTIVE...... 58

6.1 on Balls in Play ...... 58

6.2 Left On-Base Percentage ...... 60

6.3 Walks per Hits per Innings Pitched ...... 62

6.4 Earned Runs Average ...... 64

6.5 Fielding Independent Pitching ...... 65

SECTION 7: SWING VS NO SWING ...... 70

7.1 Logistic Regression Modeling ...... 71

7.1.1 Probability of Swinging Given the Count ...... 72

7.1.2 Probability of Swinging Given the Placement of the Runners ...... 74

7.1.3 Probability of Swinging Given the Position of the Hitter in the

Batting Lineup ...... 76

7.1.4 Probability of Swinging Given the ...... 78

7.1.5 Logistic Model for All Players ...... 80

7.1.6 Logistic Models for Certain Players ...... 88

7.2 Generalized Additive Modeling ...... 108

7.2.1 Probability of Swinging Based on the Location of the Pitch ...... 110

7.2.2 Probability of Swinging Based on the Hitter ...... 110 viii

SECTION 8: CONCLUSION ...... 113

REFERENCES……………………………………………………………………………… 117

APPENDIX A: PITCH TYPE DEFINITIONS ...... 121

APPENDIX B: DESCRIPTION OF PITCHF/X VARIABLES FOR PITCH

CLASSIFICATION… ...... 123

APPENDIX C: ABBREVIATIONS USED ...... 125

APPENDIX D: HELPFUL R FUNCTIONS AND LIBRARIES ...... 126

ix

LIST OF FIGURES

Figure Page

2.1 The First Pitch Effect ...... 10

2.2 Number of Strikeouts per Game from 1893 to 2015 ...... 11

2.3 Swing Rate on the First Pitch by Season ...... 14

4.1 ’s Location of First Pitches for the First Game of the 2016 World

Series ……………………………………………………………………………… . 23

4.2 ’s Location of First Pitches for the Second Game of the 2016 World

Series……………………………………………………………………………… . 23

4.3 Overall 2015 Seasonal Percentage of First Pitch Strikes Thrown vs WHIP for

Kluber vs Bauer ...... 26

4.4 Overall 2015 Seasonal Percentage of First Pitch Strikes Thrown vs Innings

Pitched for Kluber vs Bauer ...... 27

4.5 2015 Kluber First Pitch Strike Percentage by Month and Start Outcome ...... 29

4.6 Overall 2015 Kluber Percentage of First Pitch Strikes vs Earned Runs, WHIP,

and Innings Pitched ...... 30

4.7 2015 Percentage of First Pitch Strikes vs Earned Runs, WHIP, and Innings Pitched

for All Starting Pitchers ...... 31

4.8 Percentage of First Pitch Strikes Thrown vs Walk Percentage ...... 32

4.9 Percentage of First Pitch Strikes Thrown vs Strikeout Percentage ...... 32

4.10 Percentage of Event by Pitch Number ...... 34

4.11 Pitch Type Classification Example by Start Speed, Break Angle, and Break

Length for Corey Kluber ...... 36 x

4.12 Overall Pitch Location by Count for All Pitches ...... 46

5.1 Mean BA by of Pitch for First Pitches, Non-First Pitches, and All Pitches ... 51

5.2 Mean OBP by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches . 53

5.3 Mean SLG% by Result of Pitch for First Pitches, Non-First Pitches, and All

Pitches ……………………………………………………………………………... 55

5.4 Mean OPS by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches . 56

6.1 Mean BABIP by Result of Pitch for First Pitches, Non-First Pitches, and All

Pitches ……………………………………………………………………………… 60

6.2 Mean LOB% by Result of Pitch for First Pitches, Non-First Pitches, and All

Pitches ……………………………………………………………………………… 61

6.3 Mean WHIP by Result of Pitch for First Pitches, Non-First Pitches, and All

Pitches ……………………………………………………………………………… 63

6.4 Mean ERA by Result of Pitch for First Pitches, Non-First Pitches, and All

Pitches ……………………………………………………………………………… 65

6.5 Mean FIP by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches ... 67

7.1 Probability of Swinging Given the Count and Count Type ...... 74

7.2 Probability of Swinging for All Pitches and First Pitches Given the Placement

of the Runners ...... 76

7.3 Probability of Swinging for All Pitches and First Pitches Given the Position

in the Batting Lineup ...... 78

7.4 Probability of Swinging for All Pitches and First Pitches Given the Inning ...... 80

7.5 Probability of Swinging by Distance from Strike Zone and Count

Type for All Pitches ...... 91 xi

7.6 Probability of Victor Martínez Swinging Given that Runners Are Not On-Base

for All Pitches ...... 94

7.7 Probability of Victor Martínez Swinging Given that the Count Favors the Hitter

for All Pitches ...... 95

7.8 Probability of Salvador Pérez Swinging Given that Runners Are Not On-Base

for All Pitches ...... 99

7.9 Probability of Salvador Pérez Swinging Given that the Count Favors the Hitter

for All Pitches ...... 100

7.10 Probability of Swinging Given that Runners Are Not On-Base for

All Pitches………………………………………………………………………… .. 103

7.11 Probability of Adam Jones Swinging Given that the Count Favors the Hitter for

All Pitches ………………………………………………………………………… . 104

7.12 Probability of Swinging on All Pitches for Runners Not On-Base by Count Type .. 106

7.13 Probability of Swinging on First Pitches for Runners Not On-Base ...... 107

7.14 Visual Representation of GAM ...... 109

7.15 Probability of Swinging By Pitch Location ...... 110

7.16 Probability of Swinging for All Pitches by Pitch Location for Mike Trout and

Victor Martínez ...... 111

7.17 Probability of Swinging for All Pitches by Pitch Location for Adam Jones and

Salvador Pérez ...... 112

xii

LIST OF TABLES

Table Page

2.1 Count, Mean, & Standard Deviation for Values of the First Pitch ...... 10

4.1 First Pitch Type Percentage by Indian’s Pitcher in the 2016 Games

1 and 2 ……………………………………………………………………………… 24

4.2 Games 1 & 2 Comparison for Indians Pitchers Kluber and

Bauer…………………………………………………………………………...... 25

4.3 Overall 2015 Seasonal Comparison for Kluber and Bauer ...... 25

4.4 2015 Kluber Percentage of First Pitch Strikes Thrown by Start Outcome ...... 28

4.5 Percentage of Event by Pitch Number ...... 34

4.6 Pitch Type Classification Example by Start Speed, Break Angle, and Break Length

for Corey Kluber ...... 36

4.7 Percentage of Overall Pitch Type ...... 38

4.8 Percentage of Pitch Type Thrown on First Pitches and Non-First Pitches ...... 39

4.9 Percentage of Pitch Type Thrown by Count Type ...... 40

4.10 Percentage Pitch Type Thrown on First Pitches and Non-First Pitches by Hand of

Pitcher and Stance of Hitter ...... 42

4.11 Percentage of Strikeouts and Walks for Plate Appearances Passing Through 0-1 vs

1-0 Count…………………………………………………………………………… 43

4.12 Location of First Pitches and Non-First Pitches by Stance of Hitter ...... 44

4.13 Location Percentage Based on Count and Count Type ...... 47

4.14 Location Percentage Based on Count Type ...... 48

4.15 Event Percentages for First and Second Pitches ...... 48 xiii

4.16 Event Percentages by Count and Count Type ...... 49

5.1 Mean BA and Standard by Result of Pitch for First Pitches, Non-First

Pitches, and All Pitches ...... 51

5.2 Mean OBP and Standard Error by Result of Pitch for First Pitches, Non-First

Pitches, and All Pitches ...... 53

5.3 Mean SLG% and Standard Error by Result of Pitch for First Pitches, Non-First

Pitches, and All Pitches ...... 54

5.4 Mean OPS and Standard Error by Result of Pitch for First Pitches, Non-First

Pitches, and All Pitches ...... 56

6.1 Mean BABIP and Standard Error by Result of Pitch for First Pitches, Non-First

Pitches, and All Pitches ...... 59

6.2 Mean LOB% and Standard Error by Result of Pitch for First Pitches, Non-First

Pitches, and All Pitches ...... 61

6.3 Mean WHIP and Standard Error by Result of Pitch for First Pitches, Non-First

Pitches, and All Pitches ...... 63

6.4 Mean ERA and Standard Error by Result of Pitch for First Pitches, Non-First

Pitches, and All Pitches ...... 64

6.5 Mean FIP and Standard Error by Result of Pitch for First Pitches, Non-First

Pitches, and All Pitches ...... 66

6.6 Summary of Rankings for Mean Hitting and Pitching Statistics Overall ...... 69

6.7 Summary of Rankings for Mean Hitting and Pitching Statistics for First Pitches .... 69

7.1 Coefficient Estimates for Full Model for All Players and All Pitches ...... 83

7.2 Coefficient Estimates for Reduced Model for All Players and All Pitches ...... 84 xiv

7.3 Coefficient Estimates and Odds Ratio for Reduced Model for All Players and

All Pitches ………………………………………………………………………… . 86

7.4 Coefficient Estimates for Full Model for All Players and First Pitches ...... 87

7.5 Coefficient Estimates and Odds Ratio for Reduced Model for All Players and

First Pitches ………………………………………………………………………… 87

7.6 Mike Trout’s Coefficient Estimates for Full Model for All Pitches ...... 89

7.7 Mike Trout’s Coefficient Estimates for Reduced Model for All Pitches ...... 90

7.8 Coefficient Estimates and Odds Ratio for Mike Trout’s Reduced Model for All

Pitches …………………………………………………………………………….. . 91

7.9 Coefficient Estimates and Odds Ratio for Mike Trout’s Reduced Model for First

Pitches …………………………………………………………………………….. . 92

7.10 Coefficient Estimates and Odds Ratio for Victor Martínez’s Reduced Model for

All Pitches ………………………………………………………………………… . 93

7.11 Coefficient Estimates and Odds Ratio for Victor Martínez’s Reduced Model for

First Pitches ………………………………………………………………………… 96

7.12 Coefficient Estimates and Odds Ratio for Salvador Pérez’s Reduced Model for

All Pitches ………………………………………………………………………… . 98

7.13 Coefficient Estimates and Odds Ratio for Salvador Pérez’s Reduced Model for

First Pitches ………………………………………………………………………… 100

7.14 Coefficient Estimates and Odds Ratio for Adam Jones’s Reduced Model for All

Pitches ……………………………………………………………………………… 102

7.15 Coefficient Estimates and Odds Ratio for Adam Jones’s Reduced Model for First

Pitches ……………………………………………………………………………… 105 xv

7.16 Odds Ratio Comparison for All Four Players for All Pitches ...... 105

7.17 Odds Ratio Comparison for All Four Players for First Pitches ...... 107

1

SECTION 1: INTRODUCTION

1.1 Description of the Game of Baseball

Baseball. A word that many Americans are familiar with. A word that brings joy to many people as they watch or perhaps play this sport. It unites Americans. It is named “America's favorite pastime” for a reason.

The game of baseball has two teams, each with nine players on their team. The nine players consist of a pitcher (who throws the ball to the hitters), a (who catches the ball from the pitcher), a 1st baseman, a 2nd baseman, a , a 3rd baseman, a left-fielder, a center-fielder, and a right-fielder. The game is divided into innings where one team is in the field while the other team is hitting in the top half of the inning and the teams change roles for the bottom half of the inning. The team that is hitting is trying to get as many runs as they can by having the players take turns hitting the ball that is thrown by the pitcher of the fielding team. While the hitting team is trying to score runs, the fielding team is trying to stop runs from being scored. A run is scored when a player runs counterclockwise from first base to second base, to third base, and finally crosses plate. The hitter can stop at any of the bases after they hit and wait for another hitter to hit before advancing to the next base. A complete inning is executed when both teams have had a chance to a hit. Nine innings are played in each game, unless the score is tied after the ninth inning. In this case, are played until one of the teams has outscored the other.

A is the event when a hitter comes up to hit. The hitter starts with a 0-0 count, which means that there are no balls and no strikes. Formally, plate appearances are the sum of the hitter’s at-bats, walks, hit by pitches, and sacrifice flies. An at-bat is counted when the hitter gets on-base through a hit, fielder’s choice, or an error. At-bats exclude walks, hit by pitches, sacrifice hits, or when the hitter is awarded first base due to a catcher . A fielder’s 2 choice occurs when the fielding team decides to try to get a different player, which allows the hitter to reach first base safely. The pitcher throws a pitch to the hitter and the count either goes to

1-0 (one ball and zero strikes), 0-1 (zero balls and one strike), or the ball is hit “in-play” by the hitter. If four balls are thrown to the hitter, then the hitter is awarded first base. The hitter is out if there are three strikes. A ball occurs when its location over the home plate is outside the hitter’s strike zone. The strike zone is the width of the home plate and is as high as the hitter’s chest and as low as their knees. A strike can occur in the following situations: the ball is thrown within the described strike zone and the hitter doesn’t swing (a called strike), the hitter swings no matter where the ball is at and doesn’t make contact with the ball (a swinging strike), or if the hitter puts the ball in play but it is called “foul” and there are not already two strikes on the hitter. There are lines drawn from home plate through first base and out to the and then also from home plate to third base out to the outfield. These lines are called foul lines. Balls hit to the right of the first base foul line or to the left of the third base foul line are called foul. Called balls and strikes are determined by the home plate .

Essentially baseball is a confrontation between the hitting and fielding teams, and, more specifically, it is a battle between the hitter and the pitcher. The hitter is trying to put the ball in play to be able to move or advance on-base runners or to get on-base himself. The pitcher’s goal is to get the hitter out either through three strikes or by a play in the field where the gets an out. Examples of outs in the field would be a pop-up or fly ball that is caught, or a groundball that is fielded and thrown to first base before the runner reaches the base.

3

1.2 Connection between Statistics and Baseball

Statistics. Also, a word that many Americans know. This word, however, instills a different emotion. Many people cringe at the sheer sight or thought of statistics, almost like it is a swear word or a secret word that should never be spoken aloud. However, when considering the word

“counting”, there is less of a stigma. People count sheep to hopefully help fall asleep. People count the number of minutes before leaving work, the number of toes a child has when they are born, the number of days or weeks before Christmas, the number of steps completed in a day, and the list continues. Americans are used to counting; it’s almost second nature. Statistics is like counting on steroids.

Surprisingly, few people understand that baseball and statistics are so closely connected.

Ever since the creation of the sport in the 18th century, people have been trying to determine the best teams and players. People have accomplished this task by looking at several different and comparing teams and players with these different statistics to try and determine who is a better team or player. In his book, The Numbers Game, Alan Schwarz states, “When you get right down to it, no corner of American culture is more precisely counted, more passionately quantified, than the performance of baseball players” (Schwarz xiii). There are numerous count statistics, such as the number of runs scored in a game, the number of pitches thrown, the number of baserunners, and the number of innings pitched, that are measured within the game of baseball.

Anything that could possibly be counted or measured probably is somehow. This intense study of the game and players has created a new culture known as , the statistical analysis of baseball.

Sabermetricians do more than just count events in baseball. They devise statistics to measure the performances of players. “Sabermetric researchers often use statistical analysis to 4 question traditional measures of baseball evaluation such as batting average and pitcher wins”

(Birnbaum). For hitting, the normal statistic that is used to compare hitters is batting average, which is the number of hits divided by the number of at-bats. This statistic measures how likely the hitter is to get a hit out of all their opportunities to hit. Another statistic for baseball is slugging average, which is the total number of bases reached divided by the total number of at-bats. Each base is weighted so the total number of bases equals

TOTAL BASES = 1B + 2*2B + 3*3B + 4*HR where 1B, 2B, 3B, and HR represent the total number of times the hitter hits a , , , and , respectively, in a game. For pitching, the most common statistics to compare pitchers is Average (ERA) which is the average number of earned runs given up in a nine-inning game. Pitchers are also compared on the number of games in which they won, number of strikeouts, and number of walks (Albert "Sabermetrics"). These statistics are useful, but

Sabermetricians work hard to find better statistical measures to compare players. was a baseball writer and statistician in the 1970’s and his works on baseball and statistics started the field of Sabermetrics. “James argues that a hitter should be evaluated by his ability to create runs for his team. From an empirical study of a large collection of team hitting data, he established the following formula for predicting the number of runs scored in a season based on the number of hits, walks, at-bats, and recorded in a season” (Albert "Sabermetrics"). This formula is:

HITS + WALKS *(TOTAL BASES) RUNS = (AT - BATS) + WALKS

This formula takes into consideration the ability to get runners on and the ability to move runners. 5

Another hitting statistic that was created was by George Lindsey assigns “run values to each event that could occur while a team was batting” (Albert "Sabermetrics"). The formula he created used linear weights for each of the ways a hitter can get a hit. These linear weights took into account that there is an advantage to getting runners on, advancing runners and by avoiding outs.

RUNS = 0.41 *1B + .82 *2B + 1.06 *3B + 1.42 *HR

Thorn and Palmer created a statistic to better measure a pitcher’s ability than ERA.

According to Albert, “The ERA does measure the rate of a pitcher's efficiency, but it does not tell you about the actual benefit of this pitcher over an entire season” (Albert "Sabermetrics") Thorn and Palmer developed a pitching runs formula to account for the entire season.

LEAGUE ERA PITCHING RUNS = INNINGS PITCHED * - EARNED RUNS 9

The product of innings pitched and league ERA/9 gives the expected number of runs allowed for the season. By subtracting the actual number of earned runs from the average, allows for comparison of how the pitcher did during the season. If the pitching runs is greater than zero, then the pitcher is doing better than the average pitcher.

These statistics are just a few examples of how Sabermetricians work to create statistics that consider multiple factors to help better assess a player’s ability. These statistics, along with many others, are used to answer loads of questions about the game of baseball and the players.

1.3 Availability of Baseball Data

Over the last couple of years, the availability of data within the sport of baseball has exploded. From mainly three sources, Lahman database, data, and PITCHf/x data, there exists access to a vast range of statistics “from seasonal stats available since the 1871 season, to 6 data for individual games, to play-by-play accounts covering most games since 1945, to extremely detailed pitch-by-pitch data recorded for nearly all the pitches thrown in MLB parks since 2008” (Marchi and Albert 2).

The Lahman database gives pitching, fielding, and hitting statistics from 1871-2015. Sean

Lahman, creator of the Lahman Baseball Database, made his database available to the public in

1995, which created a surge of baseball research. “His Baseball Archive web site was one of the earliest sources for baseball information on the Internet, and he headed the first significant effort to make a database of baseball statistics freely available to the general public” ("About”). Within the database, information on the player’s information like the day, month, and year of birth and death can be obtained. Information on the city and state of birth and death, along with the player’s weight, height, salaries, and awards is also within the database. The database also includes all-star, hall of fame, information, and post season data. The most common type of data used from the Lahman database are the seasonal pitching and hitting statistics for all players.

Retrosheet database provides detailed play-by-play information for games since 1945. This database was actually started in the 1980’s by individual fans. These fans created an organization called Project Scoresheet, which had the goal of making baseball data freely available to the public.

These volunteers redefined the methods of score keeping and created a new scoresheet that would easily be entered into a computer. The major changes that were made to the traditional scorecard were: how innings were recorded, how to record the substitution of players, and how to record the events happening. The organization, Retrosheet, was then created in 1989 to computerize the play- by-play data. “For each play, information is reported on the situation (inning, team batting, number of outs, presence of runners on-base), the players on the field, the sequence of pitches thrown, and details on the play itself” (Marchi and Albert 18). Therefore, Retrosheet can provide a box score 7 or a statistical summary of everything that happened in the game. Game logs since 1871 can also be obtained, which gives information on who the umpires were for the game, what park the game was held at, the number of people in attendance, what day of the week the game was played on, the name and position of the starting players, and statistics for the home and visitor teams. Not only does Retrosheet provide box score data, but also play-by-play data that contains information about each plate appearance during a game. One can also find game logs and schedules of games through Retrosheet ("Site").

Lastly, the PITCHf/x data contains exhaustive information of every single pitch thrown since 2006. “Detailed tracking data were recorded for about a third of the pitches thrown in the major leagues in 2007 and more than 95 percent of the pitches in 2008 and 2009. Sportvision made these data available in real-time to Advanced Media (MLBAM) for use in its online Gameday application and to broadcasters such as ESPN for its K-Zone strike zone graphic” (Fast, “PITCHf/x”). This technology utilizes three cameras placed in the stadium that takes about 20 pictures of each pitch along its route from the pitcher’s hand to the catcher to capture the trajectory of the pitch (Fast, “PITCHf/x”). Thus, if you have seen a baseball game anytime within the last nine years, then you probably have seen the box that will appear after some pitches are thrown during a TV broadcast. This box is used to determine the pitch location relative to the strike zone. From the technology, users can know the pitch speed, release point, beginning and ending speed of the pitch, location of the pitch, type of pitch, the count the pitch was thrown on, break of the angle, whether the pitcher and hitter was right-handed or left-handed, and the outcome of the pitch (ball put in play, , swinging strike, ball, etc).

8

SECTION 2: IMPORTANCE OF PITCH COUNT AND THE FIRST PITCH

This study aims to look at the importance of the first pitch and whether or not it is beneficial for the pitcher or the hitter if the hitter swings on the first pitch. This study also explores the importance of count. Lastly, it looks at some of the circumstances that might affect a hitter to swing or not. These might include the inning, count, the position in the , or the location of the pitch.

The pitch count is the number of pitches a pitcher has thrown. The pitch count has become increasingly important to the world of baseball, not just in . New laws have now regulated the maximum number of pitches a young pitcher can throw during the game and then how many days the pitcher must rest depending on the pitch count. Not only is the number of pitches important for a young baseball pitcher, but it is also important for the major league pitchers.

Keeping the pitch count low decreases the risk for injury for the pitchers. “You don't want guys throwing tired -- their mechanics break down, and the unnatural motion of pitching becomes more stressful on the fulcrum, the elbow” (Kurtenbach). In order to keep the pitch count down, it is important for pitchers to make the most out of every pitch they throw. This leads to the count on the hitter being significant to the pitchers. The count is the number of balls and strikes on the hitter.

If the hitter reaches four balls, then they are awarded first base. If three strikes are awarded to the hitter, then the hitter is out. There is a possibility of twelve different counts with 0-0 being the count when each hitter steps into the batter’s box. The possible counts are 0-0, 0-1, 0-2, 1-0, 1-1,

1-2, 2-0, 2-1, 2-2, 3-0, 3-1, and 3-2.

The count can help determine whether the pitcher or the hitter has the advantage. For example, as proposed from analyst and observed from Marchi and Albert, a pitcher’s count is a count when the pitcher has the advantage during the at-bat and it occurs when the count 9 is either 0-2, 1-2, 2-2, or 0-1. A hitter’s count is a count when the hitter has a favorable count and is when there are at least two more balls than strikes which is 3-0, 3-1, or 2-0. When the count is

3-2, 2-1, or 1-0, then the count is considered to be a modest hitter’s count because the hitter is the same distance away from four balls as three strikes (Tango). Lastly, neutral counts consist of the counts 0-0 and 1-1. With these definitions, starting from a 0-0 count can either lead to a hitter’s count or a pitcher’s count. Therefore, the first pitch is important to setting the pace of the at-bat.

Each event in baseball has an associated run value that can be calculated. Run value is defined as the average number of runs scored in the remainder of the inning based on the scenario.

For example, a game situation where there is one out and runners on first and second has a different run value than a situation with two outs and no one on-base. The count on a hitter is the same way.

Each combination of strikes and balls leads to a different run value. Having two balls might have a higher chance of runs scoring during the remainder of the inning then having two strikes.

In his blog Exploring Baseball Data with R, Jim Albert explored the first pitch effect

(Albert "The First Pitch Effect"). Using Retrosheet data from 2013 and a function to compute the runs value for all plays, Albert extracted the first pitch from the pitch sequence variable “pseq” in the Retrosheet play-by-play database. The “pseq” variable gives the detailed information on what happened during the plate appearance. For example, a called strike or foul ball can be determined.

Retrosheet also can show if the pitcher throws to a base to try and pick off the runner or if the runner tries to steal during an at-bat. Then based on the first letter of “pseq,” he classified the first pitch into a ball, strike, or end of plate appearance (when the ball was put into play on the first pitch).

With a 0-0 count on the hitter, the run value is essentially 0. He then explored the run values associated with the count going to 0-1, 1-0, and end of plate appearance and calculated the mean 10 and standard deviation for each of the three options for the first pitch. Is there any significant difference for the run values of the first pitch for 2015 compared to 2013? The mean and standard deviations for the run values are extremely similar to the data for 2013. From Table 2.1, the run values of counts of 1-0 and 0-1 are respectively 0.0339 and -0.0387, and so the hitter’s benefit of getting a ball on the first pitch compared to getting a strike on the first pitch is approximately

0.0339-(-0.0387)=0.0726 runs. From the pitcher’s point of view, the run values decrease by 0.0726 runs when the first pitch is a strike compared to a ball.

Table 2.1 Count, Mean, & Standard Deviation for Run Values of the First Pitch Count N Mean Standard Deviation 0-1 90,917 -0.03871049 0.4481899 1-0 71,710 0.03386158 0.4884431 End of Plate Appearance 21,001 0.04502319 0.5479119 Table 2.1 shows that there is a hitter advantage of about 0.0726 runs when the first pitch is a ball compared to a first pitch strike. Alternatively, the run value is smallest for pitchers when the first pitch is a strike compared to a ball.

Figure 2.1 The First Pitch Effect

Figure 2.1 displays the mean run values for first pitch strike, first pitch ball, and end of plate appearance. The red dot represents the mean run value and the bars are plus and minus one standard deviation.

11

2.1 Previous Work

The strikeout rates are on the rise and setting records since 2008 with 2015 being the largest strikeout rate in the . “Major league baseball fans in 2014 witnessed something historic: A hitter was more likely to strike out this year than in any other season in the long history of the sport, at a rate of 7.70 times per team per game” (Treder). Over the years, the strikeout rates have risen and declined. These changes are mostly due to changes in rules, strategies and playing conditions. “Over the past 100+ years, strikeout rates have steadily climbed, with periodic dips here and there” (Treder).

Figure 2.2 Number of Strikeouts per Game from 1893 to 2015

Figure 2.2 displays the number of strikeouts per game from 1893 to 2015. Strikeouts per game have been increasing throughout time and have reached the highest number of strikeouts in 2015.

A quick review of the changes in baseball can help to understand the hills and valleys of the strikeout rates shown in Figure 2.2. In the 1900’s, the foul strike rule was adopted. This rule stated that a foul ball counted as a strike on the hitter unless there were already two strikes on the hitter. Also, the pitching distance was moved back to sixty feet and six inches, previously forty- 12 five, fifty, and then fifty-seven feet. The strikeout rates remained constant from about 1905 to

1918. From 1918 through 1919, there was a large decrease in the average number of strikeouts per game. There is no big rule change or real explanation for this decrease. “Perhaps in some mysterious way it had to do with the shortened season schedules caused by World War I” (Treder).

They bottomed out at this time and then began to rise slightly since then. From 1946 to 1963 there was a large increase in the average number of strikeouts per game. “The 1947 season is often described as the beginning of modern major league baseball, because racial integration began then”

(Treder). The rule makers in 1969 decided to make the strike zone smaller. This change in the strike zone lead to a decrease in the number of strikeouts per game. Umpires in the 1980’s were not sticking to the strike zone that was in the rule book. “It was commonly observed that hitters were achieving their newly shaped success with the apparently inadvertent assistance of umpires, who appeared ever-less willing to call the strike zone as defined in the rule book. Nearly nothing above the belt was a strike” (Treder). In 1988, the MLB threated to replace the umpires if they didn’t enforce the strike zone as defined in the rule book. 2001 lead to a small peak of the strikeout rates and a slight decrease in the following years. Since 2006, the strikeout rates have increased and continue to soar higher. “The factors favoring growth in strikeouts include a willingness of hitters to swing for power at the expense of contact, but are mostly driven by the willingness— and, most significantly, by the ability—of pitchers to pitch for strikeouts at the expense of in-game endurance, as they are deployed in ever-shorter stints” (Treder).

Since strikeout rates are increasing, are swinging rates on the first pitch also increasing?

Within the last couple of years swinging on the first pitch is indeed on the rise. Comparing 2014 to 2015 for all 30 teams in baseball, “two-thirds of the teams have increased their rates of swings at the first pitch” (Sullivan). Along with the increase of the rates of swings at the first pitch, 2015 13 was the first year that the performance of hitters who swung at the first pitch was not worse than those hitters who took the first pitch. Sullivan provides a great explanation for the increase of swinging on the first pitch:

To simplify everything: Hitters are behaving more aggressively. They had been more and

more willing to let pitchers get ahead 0-and-1, but this year hitters have fought back, as if

they finally noticed the trend. Pitchers are throwing better stuff than ever. They’re getting

more strikeouts than ever. Pitchers these days, when ahead, are incredibly difficult to hit.

So hitters have increasingly tried to get to them early, before they get in front…Hitters

should go up there ready to hit, because the first pitch might well be the best pitch to hit

that they see. (Sullivan)

There are a lot of players and coaches that suggest being patient and selective in the batter’s box. Dave Cameron, author of article called “Has the first-pitch take become a losing strategy?” argues that being patient shouldn’t mean laying off the incredibly perfect first pitch. “Being selective shouldn't be equated with standing there watching a centered, elevated get called for strike one, but maybe major-league hitters have indeed become a little too willing to take a good first pitch, only to strike out before ever seeing another meatball again” (Cameron).

From Figure 2.3, the swing rates decreased for the seasons 1988 to about 1996. From 1996 to 2001 there is an increase in the percentage of first pitches swung on. Then again there is a decrease in swing rates from 2001 to about 2010. Just like the percentage of strikeouts, the swing rates have steadily increased since about 2010.

14

Figure 2.3 Swing Rate on the First Pitch by Season

Figure 2.3 displays the swing rate on the first pitch from 1988 through 2015. Although strikeout rates are increasing through the years, swing rates have been decreasing until about 2010. Since 2010, swing rates on the first pitch are increasing.

James Gentile from the Hardball Times wrote an article called "Swinging or Taking on the

First Pitch" where he looked at the weighted on-base average (wOBA) of all plate appearances from 1988-2012. He concludes that when the first pitch is a ball, this lead to the highest wOBA of plate appearances. Looking at first pitch strikes, swinging at the first pitch results in a higher wOBA than taking a first pitch called strike (Gentile).

From Jim Albert’s article “Using the Count to Measure Pitching Performance" published in the Journal of Quantitative Analysis in Sports, the number of home runs hit has a higher proportion when the count is 0-1 when compared to 0-1. The proportion of groundouts is higher for 0-1 count than 1-0 count. When looking at in-play hit proportions, the proportion increases when going from 0-0 to 1-0 and decreases when going from 0-0 to 0-1 (Albert “Using the Count”).

This again reiterates the importance of the pitcher getting ahead of the count. Also, from the twenty pitchers chosen in the 2009 season, pitchers were throwing first pitch strikes the majority of the time when compared to first pitch balls and end of plate appearances. End of plate appearances can occur when the hitter is hit by a pitch, the hitter puts the ball in play, the hitter strikeouts, or 15 the hitter walks. For these pitchers, “there were 8628 transitions from a 0-0 to a 0-1 count, 6705 transitions from a 0-0 to a1-0 count, and 2135 transitions from the 0-0 count to an EPA event”

(Albert “Using the Count”). These transitions show that the pitchers are throwing more first pitch strikes and/or the hitter is swinging more on first pitches than the hitter taking a called ball or putting the ball into play. Lastly, looking at the balls and strikes effect proposed in the book

Analyzing Baseball Data with R, Marchi and Albert estimated that the difference in runs between a first pitch strike and a first pitch ball is over 0.07 (Marchi and Albert 169).

2.2 Different Philosophies for Hitters and Pitchers

As children grow up in the world of sports, they are usually taught certain techniques, strategies, and how to overcome obstacles at a young age. These philosophies may change slightly from to coach or different sports. They may even change as the child matures and is able to have more experiences to shape their own philosophies. Overall, players and coaches usually have a certain way/view of how to play the sport. For example, in baseball, many coaches prefer that hitters don’t swing if the count is three balls and no strikes. Other coaches give the players the green light to swing on this count. Therefore, there is a coaching aspect that might affect how pitchers pitch and how hitters hit.

Imagine stepping up to the plate for the first time, palms sweating, hands shaking and a slight fear in your eyes. You take the bat off your shoulder and get in your batting stance. The pitcher starts their wind up and the pitch is delivered. There are usually two outcomes from this pitch if you don’t swing: the pitch is a ball or a strike. Either one of these options can change your perception of the at-bat. If the pitch is a strike, then this might make you even more nervous. It could take a hit to your confidence and make you a little more anxious. If the pitch is a ball, then 16 this might allow you to take a deep breath, gain some confidence, and be ready to swing at the next pitch.

Coaches want to win games and to do this they need a strong offense to score runs. Jim

Schwanke wrote an article in 2009 that appeared in the Collegiate Baseball Newspaper titled

“Hitter Discipline Can Beat Elite Pitchers”. Schwanke states, “If you are going to create disciplined team offense to win championships, extending at-bats and increasing pitch counts are the keys” (Schwanke). Schwanke challenges hitters to be patient at the plate. Many coaches believe in a similar strategy of players being disciplined at the plate. “Almost all players that hit for a high average show great plate discipline by not swinging at bad pitches. On the other hand, hitters with low batting averages are often swinging at bad pitches” (Edlin, “Hitting”). When hitters first approach the plate, the strike zone in their eyes might change depending on the number of balls and strikes. For example, if the count is 0-0, then the hitter might shrink the strike zone and only swing at pitches that are within this new zone. If there are two strikes on the hitter, then the hitter might make the strike zone larger and swing at pitches that are anywhere close to the strike zone.

The opposite is true for balls. The more balls that are on a hitter, the more likely to are to have a more selective strike zone. For example, with two balls and no strikes, the hitter is looking for a pitch that is in their favorite hitting zone. They don’t want to swing at a pitch that they don’t like when the count is in their favor (Edlin, "Hitting”).

Now, let’s turn to the pitcher’s point of view. Imagine again, that you are taking the pitcher’s mound for the first time. Sun is beating down on you and your palms are sweaty. The catcher may feel like miles away and you aren’t confident that you can get the ball to the catcher.

The hitter steps up to the plate. The catcher gives you the sign for the pitch. You must start your windup. Your first pitch is a ball. Your confidence might drop slightly and may be less likely to 17 attack the hitter. What if your first pitch is a strike instead? This increases your confidence. The count is now in your favor and you have control on which pitch you can throw next.

While hitters are coached to be patient at the plate, pitchers are coached to go after the hitter. The pitchers want control over the count and the hitter. If pitchers can get the count in their favor, then they have more options for pitch types and locations. If the pitcher is behind in the count, then they might be more careful about the type of pitches they throw and the location of the pitches. Pitchers must mix up what they throw in certain counts. If pitchers get behind 1-0, the pitcher can’t always throw a fastball. The hitters would on to this pattern and will be looking for fastball to hit. Not only do pitchers have to mix up the type of pitch, but they also need to change the location of the pitch. This location change will force the hitter to be ready for low inside, high outside, or anywhere in between. “Pitching smart is all about constantly making the hitter change his bat speed and eye level” (Ellis).

Phil Tognetti a former collegiate and professional pitcher, states in his blog called The Full

Windup, “Establishing the count in the pitcher’s favor gives him a huge advantage. Yes, hitters will still get base hits, but staying ahead in the count makes it significantly more difficult for the hitter to do so. The pitcher must make a commitment to throw strikes early in the count, trust his stuff, and know the statistics are on his side if he throws strike one” (Tognetti). If pitchers can’t throw strikes within the first couple of pitches, then the hitters are in a really good spot. The hitters have control and can be very selective of the type of pitches they swing at. “Hitters start drooling when they step in the box with a 2-0, 3-0, or 3-1 count. They know you are limited with what you can do and know you have to throw a strike. They will often shrink their zone and if you throw it in that zone, look out. Your pitching strategy should always be to get ahead of the hitters” (Edlin,

"Pitching”). Steven Ellis, a two-time MLB Draft pick and former pitcher, created a 18 website to help provide pitching programs to parents and coaches of youth baseball pitchers. In one of his articles called “39 Youth Pitching Strategies To Keep Hitters Guessing,” Ellis talks about the pitching strategy and the importance of pitchers getting ahead. Ellis states, “When a hitter has his back against the wall, he has to expand his zone and swing at pitches he might not normally swing at to himself from striking out. This takes him out of his comfort zone and game plan for his time ” (Ellis).

These websites and blogs of different pitching and hitting strategies are great examples that children are taught at a young age the best ways to be a good pitcher or hitter. These philosophies, along with others, continue to shape the players as they grow up and mature.

19

SECTION 3: RESEARCH DESIGN

3.1 Research Questions

The overall questions are: What is the importance of the first pitch and is it beneficial to swing on the first pitch? To answer these questions, there are some underlying questions that might help get additional information to make a decision. These underlying questions include:

o What is the percentage of first pitch strikes thrown?

o What type of pitch is normally thrown on the first pitch?

o What is the swing rate for first pitches? How does this compare to the other counts?

o Looking at hitting statistics (ie. OBP, SLG%, BA and OPS), is there an advantage to the

hitter to swing on the first pitch instead of taking a first pitch strike or ball?

o Looking at pitching statistics (ie. BABIP, FIP, LOB%, WHIP, and ERA), is there an

advantage to the pitcher if the hitter swings on the first pitch vs taking a first pitch strike

or ball?

o Does the location of the pitch affect the probability the hitter will swing?

o Does the placement of runners affect the probability the hitter will swing?

o Does the position of the hitter in the batting lineup affect the probability the hitter will

swing?

o Does the count affect the probability the hitter will swing?

o How does pitchers’ first pitch strike percentage affect their strikeout or walk percentages?

o Strikeout rates have been steadily increasing over the seasons. Is this also true for first pitch

swing rates?

o What factors are significant in predicting the probability a hitter will swing for all pitches

and first pitches? 20

3.2 Methodological Design

Overall, the design plan is to use exploratory methods to explore and summarize the relationships between certain variables. The main graphical techniques that will be used are contour plots and scatterplots. Each of these methods is used in a way to help uncover the underlying structure of the data. The mean of various pitching and hitting measures is used to determine if there is any difference between a first pitch ball, a first pitch strike, and swinging at the first pitch.

In the book Analyzing Baseball Data with R, Marchi and Albert look at the pitch selection by count of for five seasons. With this information, they can see what type of pitch (four-seam fastball, two-seam fastball, curve ball, change-up, or a ) he is likely to throw depending on the count. This information/technique can then be applied to all pitchers in a season to see what pitch is normally thrown on the 0-0 count. If the hitters have a good idea of what pitch on average is thrown on a 0-0 count, then they might be more likely to swing on the first pitch.

Exploratory methods are used to investigate which variables influence the outcome of the first pitch. These variables may include if there are runners on any of the bases, the inning, the pitch type and location, whether this is the first plate appearance against the pitcher or if the hitter has already had an at-bat against this pitcher, a hitter’s batting average, the hitter and pitcher , etc.

Lastly, the methodological design will include logistic models. A logistic model is a statistical model where the response or dependent variable is categorical and you have one or more independent variables. A logistic regression model of all players is created for all pitches and first pitches. Also, four logistic regression models are created for certain players. For these logistic regression models, the response variable is binary, “0” if the hitter doesn’t swing on the first pitch 21 or “1” for if the hitter swings on the first pitch. Numerous inputs are explored, such as location of pitch, inning, runners on-base, and batting order, to help predict whether the hitter will swing or not at the first pitch.

22

SECTION 4: EXPLORATORY WORK OF THE FIRST PITCH

4.1 Importance of the First Pitch Example

The 2016 MLB World Series was played by the Chicago Cubs and the .

The Indians had not been to the World Series since 1997 and hadn’t won since 1948. The Cubs hadn’t made an appearance to the World Series since 1945 and their last World Series win was in

1908. This was an important game for both teams. The Indians won the first game and lost the second. Then, the Indians went on to win two games straight and were winning the series 3-1.

Unfortunately, the Cubs swept the next three games, allowing them to win the 2016 World Series.

The first two games of the 2016 MLB World Series were very different in terms of pitching for the Cleveland Indians. In game one, Corey Kluber from the Indians started and was able to pitch six innings. Within those six innings, he faced twenty-two hitters, threw eighty-eight pitches, gave up four hits, and no earned runs were scored. Of those hitters, 68% of the time he threw a first pitch strikes. In game two, Trevor Bauer was the starter pitcher for the Indians. Bauer faced eighteen hitters in only 3.2 innings, threw a total of eighty-seven pitches, gave up six hits, and two earned runs were scored. Of those hitters faced, only 55% of the first pitches were strikes.

Figures 4.1 and 4.2 show the pitch location and the types of pitches thrown on the first pitch for both Kluber and Bauer. From Table 4.1, Kluber tended to throw a first pitch and . Bauer tended to throw a four-seam fastball or a curveball on the first pitch. The majority of Kluber’s pitches fall within the strike zone and only about half of Bauer’s first pitches were thrown for strikes.

23

Figure 4.1 Corey Kluber’s Location of First Pitches for the First Game of the 2016 World Series

Figure 4.1 displays Corey Kluber’s first pitch type and location in the first game of the 2016 World Series. The majority of Kluber’s first pitches were inside the strikezone. CU is curveball, FF is a four-seam fastball, SI is a sinker and SL is a slider. Kluber tended to throw a curveball or a sinker on the first pitch.

Figure 4.2 Trevor Bauer’s Location of First Pitches for the Second Game of the 2016 World Series

Figure 4.2 displays Trevor Bauer’s first pitch type and location in the second game of the 2016 World Series. Compared to Kluber, Bauer was less likely to thrown a first pitch in the strikezone. CH is a changup, CU is curveball, FC is a cutter, FF is a four-seam fastball, and FT is a two-seam fastball. Bauer tended to throw a four-seam fastball or a curveball on the first pitch.

24

Table 4.1 First Pitch Type Percentage by Indian’s Pitcher in the 2016 World Series Games 1 and 2 Pitch Type Pitcher CU FF SI SL CH FC FT Corey 50 4.5 31.8 13.8 0 0 0 Kluber Trevor 27.8 50 0 0 5.6 11.1 5.6 Bauer Table 4.1 compares the pitch type percentage for first pitches between Corey Kluber and Trevor Bauer in the first and second games of the 2016 World Series. For Kluber, 50% of his first pitches are whiles 50% of Bauer’s first pitches are four-seam fastballs. Kluber didn’t throw a or cutter on his first pitches while Bauer did about 16.7% of the time. Bauer didn’t throw a sinker or slider on his first pitches while Kluber did about 45.6% of the time.

The first pitch strikes thrown by Kluber allowed for him to be able to go more innings.

Bauer threw about the same number of pitches and faced about the same number of hitters, but was only able to go a little more than three innings. First pitch strikes are crucial for pitchers and for the team. For the first game, Kluber had zero earned runs, four hits, zero walks, and six innings pitched. This equates to a WHIP (Walks Per Hits Per Inning Pitched) of 0.667. Whereas, Bauer had two earned runs, six hits, two walks and 3.2 innings pitched totaling to a WHIP of 2.5. Overall pitchers want a lower WHIP because this statistic measures how often the pitcher allowed baserunners either through a hit or a walk for the innings that they pitched. Kluber definitely had less earned runs and a lower WHIP compared to Bauer.

25

Table 4.2 2016 World Series Games 1 & 2 Comparison for Indians Pitchers Kluber and Bauer World Series Player Hitters Innings Earned Hits Walks WHIP % First Pitch Game Faced Pitched Runs Strikes Thrown 1 (8/25/16) Corey 22 6 0 4 0 0.667 68 Kluber 2 (8/26/16) Trevor 18 3.2 2 6 2 2.5 55 Bauer Table 4.2 compares Indians pitchers Corey Kluber and Trevor Bauer in their starting games of the 2016 World Series on hitters faced, innings pitched, earned runs, hits, walks, WHIP and thrown first pitch strike percentage. Kluber could pitch six innings and had a lower WHIP of 0.667 compared to 3.2 innings and WHIP of 2.5 for Bauer. Kluber also had a higher percentage of first pitch strikes thrown.

Lastly, Kluber and Bauer are compared on thrown first pitch strike percentages, innings pitched, WHIP, and earned runs for the whole season of 2015. From Table 4.3, on average Kluber pitched more innings, had less walks, had less earned runs, had a lower WHIP and threw more first pitch strikes than Bauer. Bauer did have on average a lower number of hits when he pitched compared to Kluber.

Table 4.3 Overall 2015 Seasonal Comparison for Kluber and Bauer Innings Earned % First Pitch Player Hits Walks WHIP Pitched Runs Strikes Thrown Corey Kluber 6.8 5.9 1.4 2.68 1.15 63.1 Trevor Bauer 5.6 4.83 2.53 2.76 1.72 59.3 Table 4.3 compares Indians pitchers Corey Kluber and Trevor Bauer for the whole 2015 regular season on average innings pitched, hits, walks, earned runs, WHIP and percentage of first pitch strikes thrown each game. On average, Kluber pitched more innings per game, had more hits, less walks, less earned runs, a lower WHIP and a higher percentage of first pitch strikes thrown compared to Bauer.

Looking at the relationship between thrown first pitch strike percentage and WHIP shown in Figure 4.3, Kluber had less variablity in the percentage of first pitch strikes thrown and his

WHIP compared to Bauer. The variablity is shown in Figure 4.3 as the grey shaded area. Loooking at Kluber’s graph, there is a clear downward trend in the data. As Kluber’s first pitch strike 26 percentage incresased, his WHIP slightly decreased. For Bauer, there appears to be a slight decrease also in WHIP as his first pitch strike percentage increased, but it isn’t as clear as Kluber.

Bauer had some games where he had a low first pitch strike percentage, but a really high WHIP.

There were also times where Bauer’s WHIP was high and his first pitch strike percentage was also high. This could mean that on those games, Bauer had more walks or hits and/or less innings pitched than normal creating a high WHIP for those games

Figure 4.3 Overall 2015 Seasonal Percentage of First Pitch Strikes Thrown vs WHIP for

Kluber vs Bauer

Figure 4.3 displays the first pitch strike percentage thrown vs the WHIP (Walks per Hits per Innings Pitched) for Trevor Bauer and Corey Kluber for the 2015 regular season. For Kluber the graph shows a downward trend. So as first pitch strike percentage increases, WHIP decreases. Bauer appears to have a somewhat downward trend, but there are some outliers that may affect his relationship between first pitch strike percentage and WHIP.

Lastly, thrown first pitch percentage is compared with innings pitched for Kluber and

Bauer. For Kluber, there is a clear increasing trend in the data. As Kluber pitches more first pitch strikes, he is able to pitch more innings. In comparison, Bauer’s first pitch strike percentage does not appear to have a relationship with the innings pitched. Bauer has some games where his first pitch strike percentage was high, but only pitched less than three innings. These six games that are below the shaded grey area in Figure 4.4 are the same games that are above the shaded grey area 27 for WHIP in Figure 4.3. Hitters were getting more hits and scoring more than normal, so Bauer would be less likely to continue pitching.

Figure 4.4 Overall 2015 Seasonal Percentage of First Pitch Strikes Thrown vs Innings Pitched for Kluber vs Bauer

Figure 4.4 displays the first pitch strike percentage thrown vs innings pitched for Trevor Bauer and Corey Kluber for the 2015 regular season. There doesn’t appear to be any relationship between thrown first pitch strike percentage and innings pitched for Bauer. Bauer had some outlier games were he had a high first pitch strike percentage, but pitched less than three innings. However, for Kluber, as first pitch strike percentage increases, innings pitched also increases.

Next, looking at Corey Kluber’s thrown first pitch strike percentages for the whole season, is there a relationship between the percentages of first pitch strikes thrown for each game and whether that game was classified as a win, loss, or for Kluber? Figure 4.5 is all of the by Corey Kluber in 2015. The dots represent the percentage of first pitch strikes thrown by Kluber for each game. The grey lines connect the black dots which are the average for first strike percentage for each month for each outcome of the start. Each pitching start is classified as a win, loss, or no decision. A no decision can happen in three circumstances: the score is tied when the pitcher leaves the game, the team is winning when the pitcher exits the game but then the team gives up the lead as either a tie score or score goes down, or the team is losing when the 28 pitcher leaves the game but then the team comes back to tie or take the lead. The purple lines represent the overall average percentage of first pitch strikes thrown for win, loss, and no decision.

There are some large differences in the percentage of first pitch strikes thrown for games in the same month. For example, in August there were percentages that were in the 40’s and then also some percentages in the 70’s across all pitching outcomes. For a loss outcome, there appears to be a decrease in the percentage of first pitch strikes thrown from the middle of the season to the end. Whereas, when the start outcome was classified as a win, the percentage of first pitch strikes thrown increased from July to October.

Table 4.4 2015 Kluber Percentage of First Pitch Strikes Thrown by Start Outcome Loss No Decision Win 56.063 63.286 61.667 Table 4.4 captures Kluber’s average first pitch strike percentages thrown for each start outcome. Each game is classified as a win, loss, or no decision. A win/loss occurs if the team wins/losses the game. A no decision can happen in three circumstances: 1) the score is tied when the pitcher leaves the game, 2) the team is winning when the pitcher exits the game but then the team gives up the lead as either a tie score or the score goes down, and 3) the team is losing when the pitcher leaves the game but then the team comes back to tie or take the lead.

29

Figure 4.5 2015 Kluber First Pitch Strike Percentage by Month and Start Outcome

Figure 4.5 compares the three start outcomes by month and first pitch strike percentage for Corey Kluber. There are some large differences in the percentage of first pitch strikes thrown for games in the same month and the same start outcome. When the start outcome was classified as a win, the percentage of first pitch strikes thrown increased from July to October compared to a decrease when the start outcome was a loss.

Win vs loss is not a good indicator of a successful pitching day. Instead earned runs, WHIP, and innings pitched are used. Figure 4.6 shows that for earned runs, there doesn’t appear to be a significant relationship between thrown first pitch strike percentage and earned runs. However, looking at WHIP and innings pitched there appears to be a relationship. As the percentage of first pitch strikes thrown increases, WHIP decreases. This makes sense because WHIP measures the walks per hits per inning pitched. The more strikes that are thrown on the first pitch, the less likely baserunners are to get on-base creating a lower WHIP for the pitcher. Kluber was able to go more innings on average when he had a higher first pitch strike percentage.

How does the relationship between thrown first pitch strike percentage against WHIP, innings pitched and earned runs compare for all pitchers? Corey Kluber was previously looked at, but are the patterns for Kluber true for all pitchers? Kluber is a for the Cleveland

Indians, so only the starting pitchers in the 2015 season were considered in this analysis. For 30 starting pitchers to be eligible to get credit for a win, they have to pitch at least five innings

(Lipszyc). Therefore, Figure 4.7 only displays pitcher outings where at least five innings were pitched. As with Kluber, there doesn’t appear to be a significant relationship between thrown first pitch strike percentage and earned runs. Looking at WHIP per game, as thrown first pitch strike percentage increases the WHIP per game slightly decreases. Lastly, for innings pitched there appears to be a positive relationship between innings pitched and percentage of first pitch strikes thrown. As first pitch strike percentage increases, the innings pitched per game increases. Overall, the same patterns seen in Kluber are similar to all starting pitchers for 2015.

Figure 4.6 Overall 2015 Kluber Percentage of First Pitch Strikes vs Earned Runs, WHIP, and Innings Pitched

Figure 4.6 relates thrown first pitch strike percentage against earned runs, WHIP, and innings pitched for Corey Kluber for the whole 2015 season. The more strikes Kluber threw on the first pitch, the less likely baserunners were to get on-base creating a lower WHIP. With less runners on-base, there is a lower chance of earned runs occurring. Lastly, the higher Kluber’s thrown first pitch strike percentage, the more likely Kluber could pitch more innings.

31

Figure 4.7 2015 Percentage of First Pitch Strikes vs Earned Runs, WHIP, and Innings Pitched for All Starting Pitchers

Figure 4.7 relates thrown first pitch strike percentage against earned runs, WHIP, and innings pitched for all starting pitchers with at least five innings pitched. Patterns are similar to those presented in Figure 4.6.

Overall, good pitchers throw first pitch strikes. Pitchers who struggle to throw a first pitch strike usually walk more hitters and have less strikeouts. Figures 4.8 and 4.9 show the percentage of first pitch strikes thrown against the pitchers walk percentage and strikeout percentage. This data is for all pitchers who at least faced 20 hitters for the whole season of 2015. If the pitcher had faced less than 20 hitters, then they didn’t play very often in the season and could skew the data if they were included in the analysis. The best fit line is then added to each of the graphs. For walk percentage, for every increase of one percentage of first pitch strike thrown, there is a decrease of

0.21 percent for walks. Therefore, the percentage of first pitch strikes thrown and walk percentages are negatively correlated. Looking at strikeouts, every one percent increase in the first pitch strikes 32 thrown, leads to a 0.19 percent increase in strikeouts. Therefore, as first pitch strikes thrown increases, the number of strikeouts increases and the number of walks decreases.

Figure 4.8 Percentage of First Pitch Strikes Thrown vs Walk Percentage

Figure 4.8 shows the relationshp between thrown first pitch strike percentage and walk percentage for all pitchers during the 2015 season with more than 20 hitters faced. The equation of the best fit line is Walk Percentage = 19.91- 0.2136*(First Pitch Strike Percentage). As the first pitch strike percentage increases by one, walk percentage decreases by 0.21%.

Figure 4.9 Percentage of First Pitch Strikes Thrown vs Strikeout Percentage

Figure 4.9 observes the relationship between thrown first pitch strike percentage and strikeout percentage for all pitchers during the 2015 season with more than 20 hitters faced. The best fit line shows that there is a positive relationship between thrown first pitch strike percentage and strikeout percentage. The equation for the best fit line is Strikeout Percentage = 9.19 + 0.1905*(First Pitch Strike Percentage). As the percentage of first pitch strike increases by one, the percentage of strikeouts increases by 0.19%

33

4.2 Characteristics of the First Pitch

4.2.1 Percentage of First Pitch Strikes Thrown and Swing Rate of First Pitches

On average, pitchers throw about four pitches to each hitter during the plate appearance.

Each pitch can be a called ball, called strike, swinging strike, or put-in-play by the hitter. In play refers to when the hitter makes contact with the ball and the ball is not foul. In other words, the ball is available for the defense to try and get the hitter out. Two outcomes can occur when the ball is put into play. The hitter could get out by the defense either through a fly ball that is caught or groundball that is fielded and thrown to first base before the runner reaches the base. The other option is that the defense is not able to get the hitter out so the hitter reaches one of the bases and is awarded a hit. A swinging strike includes any foul balls or missed swings. How often do pitchers throw a first pitch strike in general? The pitch sequence for each plate appearance can be found in the Retrosheet data. Using this pitch sequence, pitches can then be classified as balls, called strikes, swinging strikes, or put-in-play.

From Table 4.5, 50% of the time first pitches were strikes, 39% of the first pitches were balls, and about 11% of the time the first pitch was put-in-play. Of those strikes, 18% were swinging strikes and 32% were called strikes. The highest percentage of called strikes and called balls occurs on the first pitch. This is due to most hitters not wanting to swing on the first pitch, so it is more likely that the first pitches are called balls or strikes. The highest percentage of swinging strikes is on the fifth pitch. Hitters are more likely to put the ball in play on the sixth pitch compared to the other pitches. How often do hitters swing at the first pitch? Using Retrosheet data again, hitters on average swing about 29-30% of the time on the first pitch. A swing is classified as a swinging strike or put-in-play. The later into the count, the more likely the hitter is to swing. The fifth pitch has the highest percentage of hitter’s swinging. 34

Table 4.5 Percentage of Event by Pitch Number Total Total Swinging Called Put-In- Ball Strike Swing Strike Strike Play First Pitch 50 29 18 32 39 11 Second Pitch 44 46 28 16 38 18 Third Pitch 43 51 31 12 37 20 Fourth Pitch 44 51 33 11 35 21 Fifth Pitch 45 62 37 8 30 25 Sixth Pitch 31 55 26 5 26 29 Table 4.5 presents the event percentages by the number of pitches. Total strike equals swinging strikes plus called strikes. Total swing adds put-in-play and swinging strike. 50% of the time first pitches were strikes, 39% of the first pitches were balls, and about 11% of the time the first pitch was put-in-play. Of those first pitch strikes, 18% were swinging strikes and 32% were called strikes. The highest percentage of called strikes and called balls occurs on the first pitch.

Figure 4.10 Percentage of Event by Pitch Number

Figure 4.10 displays the percent of the event for each pitch number. For the first pitch, a called ball had the highest percentage followed by a called strike, a swinging strike and the ball being put in play. After the first pitch, a called strike had the lowest percentage for the rest of the pitches. A swinging strike had the highest percentage on the fifth pitch and on the sixth pitch putting the ball in play had the highest percentage

4.2.2 Pitch Type Classification

The most common pitch types of a pitcher are a fastball, a change-up, a curveball, and a slider. The most common fastballs are two-seam fastballs and four-seam fastballs. The difference between these two pitches is how the pitcher grips the ball. This grip can then influence movement 35 of the pitch (To see the definition of all pitch types, refer to Appendix A). The biggest difference between the pitch types is the speed and movement. The movement can occur either vertically or horizontally. Knowing the speed and movement of the pitch will allow one to classify the pitch into different categories and that is how the pitch type is obtained. For the description of the

Pitchf/x variables used for pitch classification refer to Appendix B.

2015 data for Corey Kluber can be used to see an example of pitch classification. All of

Kluber’s pitches thrown in the 2015 season can be found using the Pitchf/x data. By means of three variables (start_speed, break_angle, and break_length), all of the pitches thrown by Kluber in 2015 can be plotted.

Using model-based clustering as described by Brian Mills in his article called “Pitch

Classification with Mclust”, the clusters can be mapped to their corresponding pitch type. Kluber’s curveballs and change-ups have relatively close to the same starting speeds so their graphs are red and green respectively. From Figure 4.11, the red points have the largest break length. This makes sense since the red points are curveballs. Four-seam fastballs (green points) and sinkers (yellow points) also have similar speeds, but sinkers have a little larger break angle and break length.

Lastly, the purple points are the sliders. The sliders have a start speed that is slightly slower than the fastballs and sinkers and have a break angle that is somewhat similar to the curveball.

36

Figure 4.11 Pitch Type Classification Example by Start Speed, Break Angle, and Break Length for Corey Kluber

Table 4.6 Pitch Type Classification Example by Start Speed, Break Angle, and Break Length for Corey Kluber Cluster 1 2 3 4 5 Color Blue Red Green Purple Yellow Mean Start Speed 84.5 82.7 92.7 88.4 92.8 Pitch Type Change-up Curveball Four-seam Fastball Slider Sinker Figure 4.11 and Table 4.6 exhibits how pitches are classified given their start speed, break angle and break length for Corey Kluber. Kluber’s four-seam fastballs (green) and sinkers (yellow) have the highest average start speed. Kluber’s curveball (red) has the highest average break angle and length. The change-ups (blue) that Kluber throws have a lower start speed than the four-seam fastball, slider, or sinker and have little break length or break angle. The sliders have a start speed that are slightly slower than the fastballs and sinkers and have a break angle that is somewhat similar to the curveball.

37

Pitchers vary in the types of pitches they throw. Some pitchers may not have a curveball and have a slider instead or some pitchers might have different types of curveballs like a or a . There are also some differences of pitches among different pitchers. “Each pitcher has a slightly different grip and arm action for their pitches, so the same pitches can technically look quite different depending on the pitcher” (“Pitch”). The grip of the ball can have a big effect on the spin and/or speed. For a changeup to be successful, the pitcher must throw the pitch with the same velocity and arm angle as the fastball. The only difference is that the ball is gripped differently resulting in a slower speed. If the pitcher throws underhand or then they may have a different spin creating different movement for the ball. “Throwing sidearm definitely limits the variety of pitches you can throw. Basically, most sidearmers throw only a fastball and a slider” (Walsh, “Pitch”). Therefore, the types of pitches thrown by pitches may be the same, but how they are executed or how they move might be different for each pitcher.

4.2.3 Pitch Type Thrown on the First Pitch

Table 4.7 shows the pitch type percentages for all pitches in the 2015 season. As expected, a fastball, in general, is thrown about 49% of the time with 36% being a four-seam fastball. After the fastballs, the slider is the most likely pitch thrown to hitters followed by the changeup. The top three pitches are all fast pitches (about 80 mph or greater) and consist of 64% of the pitches thrown to hitters.

38

Table 4.7 Percentage of Overall Pitch Type Pitch type Percentage Four-Seam fastball 36 Two-Seam fastball 13 Slider 15 Change-up 11 Curveball 8 Sinker 8 Cutter 6 Splitter 2 Knuckle curve 2 1 Table 4.7 simply showcases the overall percentage of pitches thrown by pitch type. Four-seam fastballs and sliders make up about 51% of the pitches thrown in general. Two-seam fastballs and change-ups make up the next 24% of pitches thrown.

Table 4.7 showed the overall pitch type thrown. But what about the first pitch? Knowing the percentage that a certain pitch is thrown on the first pitch can give an advantage to the hitter.

If the first pitch thrown is normally a fastball, then the hitter might be more likely to swing on the first pitch since a fastball might be easier to hit than a curve ball or a change-up.

From Table 4.8, the top three pitches thrown on the first pitch on average are the four-seam fastball, the slider, and the two-seam fastball. Combined these three pitches make up 68% of the pitches thrown. This indicates that normally a fast pitch is thrown on the first pitch. The four-seam fastball, slider, and two-seam fastball are all around 80-90 mph. The only difference between these pitches is the movement. It might be easier for the hitter to swing on a fast pitch then on a curveball or change-up. Next, looking at all non-first pitches, the top two pitches thrown are the four-seam fastball and the slider. There is a tie between a two-seam fastball and a change-up for the third most common pitch thrown for non-first pitches. Change-ups are not as common on first pitches, but they are more likely to occur after the first pitch. Knowing that most of the time a non-off speed pitch is thrown on the first pitch can allow the hitter to be better prepared to swing on the first pitch. 39

Table 4.8 Percentage of Pitch Type Thrown on First Pitches and Non-First Pitches Pitch Type First Pitches Non-First Pitches Four-seam Fastball 38 35 Two-seam Fastball 14 12 Slider 13 15 Sinker 9 8 Change-up 8 12 Curveball 8 7 Cutter 5 6 Knuckle Curve 2 2 Splitter 1 2 Knuckleball 1 1 Table 4.8 compares the percentages of first pitches and non-first pitches thrown by pitch type. Four-seam fastballs are the most likely pitch thrown for first and non-first pitches. For first pitch, two-seam fastballs are the second mostly likely pitch, whereas for non-first pitches sliders have the second highest percentage.

Does the count type influence the pitch type? Table 4.9 looks at the pitch type thrown for each count type. Recall that the count favors the hitter when the count is 1-0, 2-0, 2-1, 3-0, 3-

1, or 3-2 and favors the pitcher when the count is 0-2, 1-2, 2-2, or 0-1. The neutral counts are 0-0

(first pitch) or 1-1. When the count favors the hitter, then the pitch is more likely to be a four-seam or two-seam fastball. Sliders and curveballs are more common when the count favors the pitcher.

When the count favors the pitcher, the pitcher is looking to get the hitter to swing on non-strike pitches or pitches they normally wouldn’t swing on. Throwing a slider or curveball will make the hitter try and hit a ball with lots of movement, whereas when the pitcher is behind in the count, the pitcher is looking to throw a strike to hopefully get ahead of the hitter. Fastballs are easier to control the location so pitchers are more likely to throw fastballs when they are behind.

40

Table 4.9 Percentage of Pitch Type Thrown by Count Type Count Type Pitch Type Favors Hitter Favors Pitcher Neutral Four-seam Fastball 40 32 37 Two-seam Fastball 15 10 14 Slider 11 18 13 Sinker 10 6 9 Change-up 11 11 10 Curveball 3 10 8 Cutter 6 5 6 Knuckle Curve 1 3 2 Splitter 1 2 1 Knuckleball 0 1 1 Table 4.9 looks at the percentage of pitch type by count type. A hitter’s count takes place when the count is 1-0, 2-0, 2-1, 3-0, 3-1, or 3-2. A pitcher’s count occurs when the count is 0-2, 1-2, 2- 2, or 0-1. Lastly, neutral counts are 0-0 or 1-1. The four-seam fastball is the top pitch for each count type, but is higher for hitter’s counts than either of the other two count types. Sliders and curveballs are more common when the count favors the pitcher.

How does the stance of the hitter and the hand that the pitcher throws with affect the first pitch that is thrown? The first letter represents the hand that the pitcher throws with (L=left and

R=right) and the second letter is the side that the hitter stands on. If the side is L, which is left, then the hitter hits left-handed and stands on the left side of the plate from the pitcher’s perspective.

L-L means that the pitcher throws left-handed and the hitter is left-handed.

From Table 4.10, the top pitch thrown on the first pitch regardless of the hand of the pitcher or the stance of the hitter is the four-seam fastball. The next most frequent pitch is dependent on the hand of pitcher and the stance of the hitter. The slider is the second most common pitch when the pitcher’s hand matches the stance of the hitter (either right-handed pitcher and right-handed hitter, R-R, or left-handed pitcher and left-handed hitter, L-L). When the pitcher throws left-handed and the hitter is hitting on the right side (L-R) or when the pitcher throws right- handed and the hitter is on the left side (L-R), then the second most common pitch thrown on the first pitch is a two-seam fastball. 41

Non-first pitches are considered next. Are there any similarities or differences in the pitch type thrown for the stance of hitter and hand of the pitcher compared to the first pitch? Again the top pitch thrown for non-first pitches regardless of stance of the hitter or hand of the pitcher is the four-seam fastball. Similarly, when the stance of the hitter and the hand of the pitcher match (L-L or R-R), then the second most common pitch is the slider. The major difference between first pitch and non-first pitches is the second most common pitch for non-matching hand and stance (L-R or

R-L). For first pitches, the second most common pitch was the two-seam fastball, but for non-first pitches it was the change-up.

This information can help prepare the hitter on what they are likely to see when they step up to the plate. If the hitter is ready for a fast speed pitch, then they can swing if it is a fast speed pitch and lay off the pitch if the pitch is off speed. As a hitter, it is important to try and predict the pitch beforehand and as soon as the pitch is thrown so it gives the hitter enough time to swing if the hitter chooses to do so.

42

Table 4.10 Percentage Pitch Type Thrown on First Pitches and Non-First Pitches by Hand of Pitcher and Stance of Hitter First Pitches Non-First Pitches Pitch Type L-L L-R R-L R-R L-L L-R R-L R-R Four-seam 40 36 37 40 35 33 35 36 Fastball Slider 20 8 7 19 22 12 11 20 Two-seam 13 19 16 11 13 15 12 11 Fastball Sinker 9 10 9 9 9 8 7 8 Change-up 2 14 11 3 4 18 15 7 Curveball 8 7 9 7 9 7 7 7 Cutter 5 3 5 7 5 4 6 7 Knuckle 2 2 3 2 0 0 1 1 Curve Splitter 0 0 2 1 1 1 3 2 Knuckleball 0 0 1 1 0 0 1 1 Table 4.10 compares first pitches vs non-first pitches for the pitch type and by the hand of the pitcher and stance of the hitter. L-L represents that the pitcher is throwing left-handed and the hitter is hitting left-handed which means that they would stand on the left of the plate from the pitcher’s perspective. The top pitch thrown for each situation is underlined and second top pitch is bold and italic. The top pitch regardless of hand of the pitcher and stance of the hitter is the four- seam fastball. The second most likely pitch depends on the hand of the pitcher, the stance of the hitter, and if the pitch is a first pitch or not.

43

4.2.4 Percentage of Strikeouts and Walks for Plate Appearances Passing Through 0-1 vs 1-0 Count

As described before, the first pitch can lead to a 1-0 count (a ball), 0-1 count (a swinging strike or called strike), or the hitter can put the ball in play. What is the result of the plate appearance if the count passes through 0-1 vs 1-0? How does passing through these counts change the chance of a strikeout versus a walk? In general, for all counts, there is a 20% chance the plate appearance will end up with a strikeout, 9% chance the hitter will be walked, and 71% chance the ball is put in play. When the first pitch is a strike, there is a higher chance of the plate appearance leading to a strikeout. When the first pitch is a ball, the percentage of walks increases.

Table 4.11 Percentage of Strikeouts and Walks for Plate Appearances Passing Through 0-1 vs 1-0 Count % Strikeouts % Walks % Other Events Overall 20 9 71 When first pitch is a strike 28 5 66 When first pitch is a ball 16 15 69 Table 4.11 looks at plate appearances that pass through the count 0-1 or 1-0 and compares the outcome of those plate appearances. When the pitcher can throw a first pitch strike, then the percentage of a strikeout increases. If a first pitch ball is thrown, a walk is more likely to occur.

4.3 Pitch Location by Stance and Count

Where are first pitches normally located by the stance of the hitter? Is there any difference in location for first pitches vs non-first pitches? Looking at Table 4.12, the majority of first pitches are thrown in the strike zone regardless of the stance of the hitter. Also, the percentage of pitches thrown low and high are relatively similar for left-handed and right-handed hitters. There is a higher percentage of pitches thrown inside for right-handed hitters than left-handed hitters. For left-handed hitters, there is a higher percentage of first pitches thrown outside than right-handed hitters. Non-first pitches are less likely to be in the zone compared to first pitches. Overall non- 44 first pitches are more likely to be low, high and inside compared to first pitches regardless of stance. Non-first pitches are less likely to be outside than first pitches.

Table 4.12 Location of First Pitches and Non-First Pitches by Stance of Hitter First Pitches Non-First Pitches Location L R L R In strike zone 51.6 55.8 47.2 49.0 Low 19.3 19.4 23.8 25.2 High 8.6 8.0 9.4 8.2 Inside 3.8 6.2 5.0 7.2 Outside 16.6 10.6 14.6 10.4 Table 4.12 is the location of first pitches and non-first pitches by the stance of the hitter. The majority of first pitches are thrown in the strike zone regardless of the stance of the hitter. Non- first pitches are less likely to be in the strike zone compared to first pitches and more likely to be low, high, or inside.

The stance of the hitter and the hand that the pitcher throws with affects the location of the different pitches thrown. For the four-seam fastball, these pitches are normally thrown away from the hitter regardless of the hand of the pitcher. The two-seam fastballs usually are thrown low in the strike zone. The next most common pitch used is a slider. Sliders usually occur when the hand of the pitcher matches the stance of the hitter (i.e. a left-handed pitcher and a left-handed hitter or a right-handed pitcher and a right-handed hitter). The sliders are usually thrown away from the hitter and low in the strike zone. The curveball’s location is similar to the slider. For the changeup, left-handed pitchers usually only throw this pitch to a right-handed hitter and a right-handed pitcher usually only throws it to a left-handed hitter. “Unlike the slider, the changeup doesn’t break into an opposite-handed hitter’s swing, which helps explain why it’s generally a much more popular off-speed option in those situations” (Goldsberry).

Figure 4.12 is a contour plot in a two-dimensional space. A contour plot allows for a representation of the distribution of the data. Consider looking at the plot from a bird’s eye view. 45

Think about the circles as being like stacked up on top of each other to create like a pyramid shape.

The wide circles that are on the outside are locations of the pitches that don’t happen that often.

The small circles that in the very center show where most of the pitches are thrown on each count.

From Figure 4.12, the count has a huge effect on where the pitches are likely to be thrown on the next pitch. For example, looking at the 0-2 count, there is a wide variety of locations for the next pitch. The 0-2 pitch is when there are zero balls and two strikes on the hitter. The pitcher is likely to throw a “waste” pitch in hopes that they can get the hitter to swing at a pitch outside of the strike zone. The pitcher hopes that they can get the hitter to strike out or to produce a weak hit that makes for an easy out for the defense. The 1-2 count also follows a similar pattern. On the contrary, looking at the 3-1 pitch, where there are three balls and one strike on the hitter, the majority of the pitches thrown are within the strike zone. With three balls on a hitter, the pitcher is trying to avoid walking the hitter so they are more likely to throw a strike. The 0-0, 0-1, 1-0, 1-

1, 2-1, and 2-2 counts all have structures that are similar to each other. Most pitches are within the strike zone, but a good proportion are outside the strike zone. 46

Figure 4.12 Overall Pitch Location by Count for All Pitches

Figure 4.12 illustrates the location of the pitch by the count for all pitches by using a contour graph. When the pitcher is able to get two strikes on the hitter, then there is a wide variety of locations for the next pitch. On the contrary, if the pitcher has three balls on the hitter, then the pitch is more likely to be to the strike zone.

The contour graphs are good visual displays, but might be hard to quantify the location of the pitch. Another way to look at pitch location is to break the location into inside and outside of the strike zone. Outside of the strike zone can occur when the pitch is thrown too far inside, too far outside, too high, or too low to be called a strike. The highest percentage of pitches thrown inside the strike zone occurs on 3-0, 3-1 and 3-2 count. The pitcher is trying to avoid walking the hitter so they are more likely to throw a strike on these counts. The lowest percentage of pitches thrown in the strike zone occurs when there are two strikes on the hitter and less than three balls.

In these situations, the pitcher has the advantage because they are only one strike away from striking out the hitter and the hitter is at least two balls away from walking. The pitcher can afford to throw outside the strike zone with hopes that the hitter might swing at a pitch outside of the 47 strike zone and miss. High pitches are mostly likely to occur on the 0-2, 3-0 or 1-2 count. Low pitches are normally thrown when the count is in the hitter’s favor. For inside and outside pitches, there doesn’t appear to be much of a count effect, but the highest percentage of inside pitches (8%) occurs on the 0-1 count and the highest percentage of outside pitches (13.9%) occurs on the 0-2 count.

The highest percentage of pitches in the strike zone occurs when the count favors the hitter.

Pitchers are trying to come back from being down in the count, so they are more likely to throw a strike. The lowest percentages of pitches in the strike zone occur when the count is in the pitcher’s favor. Pitchers are ahead in the count and are trying to get hitters to swing at pitches outside of the strike zone. Also, when the count favors the hitter, the pitcher is more likely to throw outside of the strike zone compared to the other counts.

Table 4.13 Location Percentage Based on Count and Count Type Outside of strike zone Count Type Count Inside strike zone Low High Inside Outside 0-0 54.0 19.4 8.3 5.1 13.2 Neutral 1-1 50.1 23.8 6.8 7.5 11.8 0-1 45.0 26.3 7.5 8.0 13.2 0-2 32.6 32.0 16.1 5.4 13.9 Favors Pitcher 1-2 38.0 32.0 10.7 6.4 13.0 2-2 47.1 26.8 7.9 6.4 11.7 1-0 54.9 20.6 7.1 5.1 12.3 2-0 57.5 16.3 9.8 4.4 12.0 2-1 56.5 20.3 6.4 5.5 11.3 Favors Hitter 3-0 57.7 11.7 16.2 3.3 11.0 3-1 62.1 16.0 6.6 4.4 10.9 3-2 58.5 19.7 7.4 5.2 9.3 Table 4.13 demonstrates the location percentage by count and count type. Low pitches are normally thrown when the count is in the hitter’s favor. The highest percentages of pitches thrown inside the strike zone occurs when there are three balls on the hitter and the lowest percentages occur during a count that favors the pitcher.

48

Table 4.14 Location Percentage Based on Count Type Outside of strike zone Count Type In strike zone Low High Inside Outside Neutral 52.9 20.6 7.9 5.8 12.8 Favors Pitcher 41.4 28.9 9.9 6.8 13.0 Favors Hitter 56.9 19.1 7.7 5.0 11.4 Table 4.14 summarizes Table 4.13 by just looking at the location of the pitch for the three count types. Pitches are more likely to be low and outside the strike zone when the count favors the pitcher. When the count favors the hitter, there is a 57% chance that the pitch will be in the strike zone.

4.4 Hit Quality

If the hitter does swing on the first pitch and puts the ball into play, what is the chance the ball is a hit or an out? Within the Retrosheet data, there is information on what type of event occurred during each plate appearance. The only significant events at this point are whether the event was a single, double, triple, home run, or an out. The highest percentage of hitting a single, double, triple or home run occurs on the first pitch, when the count is 0-0. Getting an out also has the highest percentage on the first pitch.

Table 4.15 Event Percentages for First and Second Pitches Count Single Double Triple Home run Out 0-0 15.6 16.7 16.5 17.4 18.8 0-1 12.9 11.7 10.3 10.0 13.6 1-0 9.2 10.3 10.0 11.5 10.0 Table 4.15 contrasts the first and second pitches by event percentages. The highest percentage of hitting a single, double, triple or home run occurs on the first pitch.

There is a hitter versus pitcher advantage from Table 4.16. Outs are more likely to occur when the count favors the pitcher (40.3%) versus a neutral count (30.7%) and a hitters count

(29.1%). Home runs are more likely to be hit in a hitter’s count (39%) versus a pitcher’s count

(31.4%) or a neutral count (29.7%). A single, double and triple are more likely to be hit on a pitchers count. 49

Table 4.16 Event Percentages by Count and Count Type Event Count Type Count Single Double Triple Home run Out 0-0 15.6 16.7 16.5 17.4 18.8 Neutral 1-1 12.0 11.6 11.6 12.3 11.9 0-1 12.9 11.7 10.3 10.0 13.6 Favors 0-2 6.6 5.7 6.1 4.3 5.7 Pitcher 1-2 11.6 9.9 8.6 8.3 9.8 2-2 11.0 10.7 11.6 8.8 11.2 1-0 9.2 10.3 10.0 11.5 10.0 2-0 3.4 3.8 5.0 5.7 3.0 Favors 2-1 7.0 7.5 7.2 7.2 6.5 Hitter 3-0 0.2 0.4 0.0 0.5 0.1 3-1 2.8 3.4 4.5 5.0 2.8 3-2 7.6 8.3 8.5 9.1 6.7

Event Count Type Single Double Triple Home run Out Neutral 27.6 28.3 28.1 29.7 30.7 Favors Pitcher 42.1 38 36.6 31.4 40.3 Favors Hitter 30.2 33.7 35.2 39 29.1 Table 4.16 highlights the event percentages by count and then looks broadly at count type in general. Outs are more likely to occur when the count is favors the pitcher. All events (singles, double, triples, home runs, and outs) have a higher chance of occurring on the first pitch. Home runs in general are more likely to be hit when the count favors the hitter. Singles, doubles, and triples have the highest percentages when the count favors the pitcher.

50

SECTION 5: FIRST PITCH-HITTER PERSPECTIVE

In the previous sections, there was a lot of introduction to the first pitch and importance of the count. Now that there is a general understanding of the first pitch, how does the first pitch influence hitters? Do hitters that take a first pitch strike or ball have higher hitting statistics than hitters who swing on the first pitch or put the ball in play? From the hitter’s perspective, what are the benefits or consequences of swinging on the first pitch compared to watching a first pitch strike or ball? Some hitters might swing on the first pitch to hopefully get a good pitch to hit. Other hitters might tend to be patient at the plate. The popular hitting statistics are batting average, on- base percentage, slugging percentage, and on-base plus slugging. The mean and standard error are given for each statistic by the result of the pitch for first pitches, non-first pitches and for all pitches.

The four outcomes most interested in are a called ball, a called strike, a swinging strike, and a ball put-in-play. Swinging strikes includes foul balls and when the hitter swung but missed contact with the ball. A ball put-in-play refers to any pitch that the hitter makes contact with and stays fair, regardless of whether it is a recorded hit or an out. This comparison will provide some insight from the hitter’s perspective if it is beneficial to swing or not on the first pitch. The common abbreviations used within the next two sections can be found in Appendix C.

5.1 Batting Average

The batting average (BA) of a player was one of the first statistics to measure the performance of a player. Just like the game of baseball, this statistic was also adopted from the game of cricket. Chadwick changed the statistic from runs scored divided by outs in cricket to hits divided by at-bats; this still holds true. The batting average for a player is defined as the total number of hits divided by their total at-bats. Recall from Section 1.1 that at-bats are plate 51 appearances removing walks, hit by pitches, sacrifice hits, or awarded first base due to catcher interferences. The batting average has several drawbacks. “For instance, batting average doesn’t take into account the number of times a hitter reaches base via walks or hit by pitches. And it doesn’t take into account hit type (with a double, triple or home run being more valuable than a single)” ("Glossary"). Although batting average has some drawbacks, it is widely used among

Sabermetricians and fans.

Table 5.1 Mean BA and Standard Error by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches OVERALL FIRST PITCH NON-FIRST PITCH Result of Pitch Mean SE Mean SE Mean SE Ball 0.2532 0.0001 0.2535 0.0002 0.2530 0.0001 Called Strike 0.2506 0.0001 0.2519 0.0002 0.2494 0.0002 Swinging Strike 0.2511 0.0001 0.2508 0.0003 0.2511 0.0001 Put in Play 0.2543 0.0001 0.2519 0.0002 0.2547 0.0001 Table 5.1 shows that a ball on the first pitch is most beneficial for a hitter’s batting average and a swinging strike results in the lowest batting average. For all pitches, putting the ball in play produces the highest mean batting average and a called strike is the lowest.

Figure 5.1 Mean BA by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches

Figure 5.1 takes Table 5.1 and displays the mean BA by result of pitch for all pitches, first pitches, and non-first pitches. For the first pitch, the highest mean BA occurs when the pitch is a ball. For all pitches and non-first pitches, the highest mean BA is when the ball is put in play. 52

The overall portion of the table represents all pitches. Overall, the highest mean batting average occurs when the pitch is put into play. The next occurs when the pitch is a ball. Third highest batting average is when the pitch is a swinging strike and in last is the called strike. Overall, if the pitch is going to be a strike, then it is better for the hitter to swing than to have a called strike.

By swinging, this gives the hitter a chance to put the ball in play which resulted in the highest overall batting average. Now, looking at just the first pitch, having a called ball results in the highest mean batting average. The second highest is closely tied between called strike and putting the ball in play. Lastly, a swinging strike has the lowest batting average for the first pitch.

Therefore, for the first pitch, the hitter has a batting average slightly higher when they don’t swing.

5.2 On-Base Percentage H + BB + HBP OBP = AB + BB + HBP + SF

On-base Percentage (OBP) is defined as the number of times a hitter reaches base, whether through a hit, walk, or being hit by a pitch divided by their plate appearances. As mentioned before in Section 1.1, plate appearances is the sum of the hitter’s at-bats, walks, hit by pitches, and sacrifice flies. According to , a player is seen to have an above average OBP if it is above 0.32. “Players with high on-base percentages avoid making outs and reach base at a high rate, prolonging games and giving their team more opportunities to score” ("Complete List

(Offense)"). If a fielder’s error occurs or if there is a fielder’s choice, then the hitter is not awarded the on-base. OBP is a slightly better statistical measure than BA because OBP accounts for when the hitter is patient at the plate and can draw walks.

53

Table 5.2 Mean OBP and Standard Error by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches OVERALL FIRST PITCH NON-FIRST PITCH Result of Pitch Mean SE Mean SE Mean SE Ball 0.3191 0.0001 0.3178 0.0002 0.3197 0.0001 Called Strike 0.3148 0.0002 0.3142 0.0002 0.3154 0.0002 Swinging Strike 0.3138 0.0001 0.3128 0.0003 0.3140 0.0001 Put in Play 0.3155 0.0001 0.3117 0.0004 0.3162 0.0002 Table 5.2 showcases that a called ball on the first pitch yields the highest mean OBP. Putting the ball in play on the first pitch produces the lowest OBP for the hitter. A swinging strike is the third lowest OBP which indicates that it is better to not swing on the first pitch. Looking at all pitches, a called ball has the highest OBP and a swinging strike is the lowest OBP.

Figure 5.2 Mean OBP by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches

Figure 5.2 takes Table 5.2 and displays the mean OBP by result of pitch for all pitches, first pitches, and non-first pitches. The highest mean OBP, no matter the type of pitch, occurs on a ball.

Looking at overall pitches, when the hitter takes a ball this event produces the highest mean

OBP. The next highest overall OBP occurs when the pitch is put into play, followed by a called strike and lastly a swinging strike. The ordering of highest to lowest OBP make sense because

OBP measures patience at the plate, so by taking balls the hitter is more likely to produce walks resulting in a higher OBP. Focusing on the first pitch, again the highest OBP occurs when the first pitch is a ball. A first pitch called strike leads to the second highest OBP, followed by swinging 54 strike and putting the ball in play on the first pitch. For OBP there is a benefit to taking the first pitch compared to swinging and possibly missing or putting the ball in play.

5.3 Slugging Percentage 1B + 2*2B + 3*3B + 4*HR SLG% = AB

Slugging percentage (SLG%) only looks at hits and does not include hit by pitches or walks like OBP does. “Slugging percentage represents the total number of bases a player records per at- bat” (“Glossary”). Slugging percentage takes into account that not all hits are the same. A home run is more valuable than a triple, triple more valuable than a double, and so on. Therefore, this statistic is a measure of which hitters might be more likely to hit for power. The drawback to slugging percentage is that doubles are not actually twice as valuable as singles or that home runs are not four times as valuable as singles.

Table 5.3 Mean SLG% and Standard Error by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches OVERALL FIRST PITCH NON-FIRST PITCH Result of Pitch Mean SE Mean SE Mean SE Ball 0.4059 0.0002 0.4045 0.0003 0.4064 0.0002 Called Strike 0.3968 0.0003 0.3975 0.0004 0.3960 0.0004 Swinging Strike 0.4026 0.0002 0.4032 0.0005 0.4025 0.0002 Put in Play 0.4013 0.0002 0.3969 0.0007 0.4022 0.0003 Table 5.3 looks at the slugging percentage mean and standard error for all pitches, first pitches, and non-first pitches. Overall, for all pitches the highest mean slugging percentage occurs when a ball is thrown and the lowest mean is when the ball is a called strike. For the first pitch, a ball has the highest slugging percentage for the hitter and putting the ball in play has the lowest slugging percentage.

55

Figure 5.3 Mean SLG% by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches

Figure 5.3 takes Table 5.3 and displays the mean SLG% by result of pitch for all pitches, first pitches, and non-first pitches. Whether the pitch is a first pitch, non-first pitch, or all pitches, the highest mean SLG% occurs when the pitch is a ball.

Looking at the overall SLG% for all pitches, a called ball has the highest mean SLG%. The lowest is a called strike with swinging strike as second and putting the ball in play as third. Overall, a swinging strike is better than a called strike for SLG%. Now turning to the first pitch, the highest

SLG% is still the called ball. Second highest SLG% for the first pitch is a swinging strike with a called strike in third and putting the ball in play as fourth. If the first pitch is going to be a strike, then it is better for the hitter’s SLG% if they swing at the first pitch compared to the taking a first pitch called strike.

5.4 On-Base Plus Slugging OPS = OBP + SLG%

Hitters have two main goals. They are trying to get on-base and also hit for power which is exactly what the statistic On-base Plus Slugging measures. On-base Plus Slugging (OPS) is exactly how it sounds. It is the player’s on-base percentage and their slugging percentage added 56 together. On-base Plus Slugging is an overall measure of hitting effectiveness combining “how well a hitter can reach base with how well he can hit for average and for power” ("Glossary").

Table 5.4 Mean OPS and Standard Error Result of Pitch for First Pitches, Non-First Pitches, and All Pitches OVERALL FIRST PITCH NON-FIRST PITCH Result of Pitch Mean SE Mean SE Mean SE Ball 0.7249 0.0003 0.7223 0.0005 0.7260 0.0003 Called Strike 0.7115 0.0004 0.7116 0.0006 0.7114 0.0006 Swinging Strike 0.7164 0.0003 0.7160 0.0008 0.7165 0.0003 Put in Play 0.7168 0.0004 0.7085 0.0010 0.7183 0.0004 Table 5.4 combines OBP and SLG% to create OPS. The highest OPS for all pitches occurs when the pitch is a ball and the lowest OPS is a called strike. For the first pitch, a called ball is also the highest OPS and the lowest occurs when the ball is put into play.

Figure 5.4 Mean OPS by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches

Figure 5.4 takes Table 5.4 and displays the mean OPS by result of pitch for all pitches, first pitches, and non-first pitches. The lowest OPS for the first pitches occurs when the ball is put in play. For all three types of pitches, the highest mean OPS occurs when the pitch is a ball.

A called ball has an overall mean OPS that is 0.0081 higher than the ball being put into play, 0.0085 higher than a swinging strike and 0.0134 higher than a called strike. Overall, a swinging strike has a slightly higher mean OPS than a called strike. Therefore, if the pitch is a ball, 57 this leads to the highest mean OPS. If the pitch is going to be a strike, the hitter has a higher OPS if they swing compared to taking a called strike. Now turning to the first pitch, there are similar findings. A called ball has the highest mean OPS for the first pitch, followed by, swinging strike, called strike and put into play. Thus, for the first pitch, putting the ball in play leads to the lowest mean OPS. Again, it is seen that a swinging strike has a slightly higher mean OPS (0.0044) compared to a called strike.

58

SECTION 6: FIRST PITCH-PITCHER PERSPECTIVE

Previously Section 5 explored hitting statistics by outcome of the pitch. That section examined whether it was beneficial for the hitter to swing or not in general and on the first pitch.

Section 6 focuses on the outcome of the pitch for various pitching statistics and whether it is beneficial or not for the pitcher if the pitch outcome is a called ball, called strike, swinging strike, or the ball is put into play. From the pitcher’s perspective, they are usually looking to get ahead of the hitter. They want the count in their favor. They want to be in control and have some leeway on the type of pitches they can throw as well as their location. By throwing a first pitch strike, it allows the count to be a pitcher’s count. Whereas if they throw a first pitch ball, the count is now in the hitter’s favor. Is it better for pitchers to get the hitter to swing on the first pitch or not? Several pitching measures (BABIP, FIP, LOB%, WHIP, and ERA) are considered to help get a picture of what is most beneficial for the pitcher. These statistics are widely used by Sabermetricians to evaluate pitchers. Each statistic has a unique purpose and has both pros and cons to use to evaluate a pitcher. These statistics combined can create a clear picture of how the result of the pitch relates to mean pitching statistics.

6.1 Batting Average on Balls In Play H - HR BABIP = AB - K - HR + SF

Batting Average on Balls In Play (BABIP) measures the fraction of hits a pitcher is giving up when the ball is put into play. The lower the BABIP, the better. If the plate appearance does not end in a home run, sacrifice hit, catcher’s interferences, , walk, or strikeout, then the ball is said to be “put in play.” In other words, the defense has a chance of making a play on the ball. Strikeouts, hit by pitches, walks, and home runs are all generally controlled by the pitcher. 59

Anytime the ball is put into play, the pitcher has little effect on the outcome of the play. “BABIP is likely even more important when evaluating pitchers because they have almost no control over what happens to a ball once it is put in play. A pitcher can control their strikeouts, walks, and home runs, and through those, the number of balls they allow to be put into play, but once the ball leaves the bat, it’s out of their hands” ("Complete List (Pitching)”). Therefore, several factors (defense, talent, and luck) can lead to a higher or lower BABIP for the pitcher. If the pitcher has a strong defense behind them, then there is a higher chance that less hits will be awarded. If the ball is hit hard, then it makes it very difficult for the defense to make a play on the ball resulting in a hit.

Thus, the harder the ball is hit, the more likely the runner will be awarded a base and a slightly higher pitcher BABIP occurs. Lastly, luck plays some role in BABIP. “Batters and pitchers do not have complete control over where a ball lands so even high quality contact can turn into outs and low quality contact can turn into hits” ("Complete List (Pitching)”). These three factors can influence the pitcher’s BABIP even though these factors are not controlled by the pitcher. BABIP is most effective when at least 2,000 balls have been put in play against the pitcher ("Complete

List (Pitching)”).

Table 6.1 Mean BABIP and Standard Error by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches OVERALL FIRST PITCH NON-FIRST PITCH Result of Pitch Mean SE Mean SE Mean SE Ball 0.2947 0.0001 0.2948 0.0002 0.2947 0.0001 Called Strike 0.2945 0.0001 0.2947 0.0002 0.2943 0.0002 Swinging Strike 0.2939 0.0001 0.2943 0.0003 0.2938 0.0001 Put in Play 0.2950 0.0001 0.2952 0.0003 0.2950 0.0001 In Table 6.1 for all pitches, a swinging strike produces the lowest BABIP for the pitcher. The highest BABIP for the pitcher occurs when the pitch is put in play. The same is true for the first pitch. A called strike and swinging strike are the best outcomes for the pitcher’s BABIP.

60

Figure 6.1 Mean BABIP by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches

Figure 6.1 takes Table 6.1 and displays the mean BABIP by result of pitch for all pitches, first pitches, and non-first pitches. The lowest mean BABIP for all three types of pitches occurs when the pitch results in a swinging strike. The highest BABIP for the first pitch is when the pitch is put in play.

A swinging strike has the lowest overall mean BABIP for the pitcher. The second lowest overall BABIP is a called strike. A called ball has the third lowest mean BABIP and when the hitter puts the ball in play gives the pitcher the highest BABIP. Overall, if the pitcher can get a strike, either from a swinging strike or called strike, then these outcomes result in the lowest mean

BABIP. Looking at the first pitch results in the same conclusions. A swinging strike has the lowest mean BABIP and the highest BABIP occurs when the ball is put in play. Therefore, if the pitcher can get a strike on the first pitch, they are better off for their mean BABIP.

6.2 Left On-Base Percentage H + BB + HBP - R LOB% = H + BB + HBP - (1.4*HR)

Left On-Base Percentage (LOB%) is the percentage of baserunners that didn’t score when on-base over the total number of baserunners. It also is referred to as the strand rate. League 61 averages are about 70%, which means that on average pitchers are able to keep 70% of baserunners from scoring. A key component to having a high LOB% is pitching well when runners are on-base

("Complete List (Pitching)”). Some pitchers might perform great with no runners on but as soon as runners are on-base, they don’t perform to the same level. LOB% can help to identify those pitchers.

Table 6.2 Mean LOB% and Standard Error by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches OVERALL FIRST PITCH NON-FIRST PITCH Result of Pitch Mean SE Mean SE Mean SE Ball 73.6198 0.0193 73.5828 0.0360 73.6346 0.0228 Called Strike 73.7190 0.0282 73.6311 0.0408 73.8050 0.0389 Swinging Strike 73.9199 0.0215 73.7976 0.0515 73.9429 0.0237 Put in Play 73.4610 0.0262 73.4697 0.0627 73.4593 0.0288 Table 6.2 shows that the overall highest LOB% occurs when the hitter has a swinging strike and the lowest when the pitch is put into play. On the first pitch, the same outcome is present. A swinging strike and called strike result in the highest LOB%, so throwing a first pitch strike is important for pitchers.

Figure 6.2 Mean LOB% by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches

Figure 6.2 takes Table 6.2 and displays the mean LOB% by result of pitch for all pitches, first pitches, and non-first pitches. For all three types of pitches, the highest mean LOB% is when the pitch results in a swinging strike and the lowest is when the pitch is put in play. 62

The highest overall LOB% occurs when the outcome of the pitch is a swinging strike.

When the hitter puts the ball in play results in the lowest overall LOB% for the pitcher. If the pitcher can get a strike on the hitter, either through a swing and miss or a called strike, the pitcher has a higher mean LOB%. A similar observation occurs on the first pitch. A swinging strike and called strike have the highest LOB%, followed by a ball and then having the ball put in play.

Putting the ball in play increases the chance that baserunners can get on-base and score which is why it results in the lowest LOB% for the pitcher.

6.3 Walks per Hits per Innings Pitched BB + H WHIP = IP

Walks per Hits per Inning Pitched (WHIP) measures how well the pitcher has done his job of keeping baserunners off of the bases. The lower the WHIP the better the pitcher’s performance.

An average WHIP for a pitcher is around 1.32. “WHIP does not consider the way in which a hitter reached base. (Obviously, home runs are more harmful to pitchers than walks.) Hit batsmen, errors and hitters who reach via fielder's choice do not count against a pitcher's WHIP” ("Glossary").

WHIP evaluates the pitcher’s performance without regard to errors or unearned runs. The few negatives of WHIP are: “If you want to measure base runners allowed using a rate stat, OBP against is a better choice because hitters faced is a better denominator than innings. WHIP is also lacking in that it treats all times on-base equally, equating a walk with a home run” ("Complete List

(Pitching)").

63

Table 6.3 Mean WHIP and Standard Error by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches OVERALL FIRST PITCH NON-FIRST PITCH Result of Pitch Mean SE Mean SE Mean SE Ball 1.3204 0.0007 1.3213 0.0013 1.3201 0.0008 Called Strike 1.3059 0.0010 1.3062 0.0014 1.3056 0.0013 Swinging Strike 1.2981 0.0008 1.2983 0.0019 1.2981 0.0008 Put in Play 1.3166 0.0008 1.3142 0.0026 1.3171 0.0011 Table 6.3 looks at the mean WHIP by result of pitch for all pitches, first pitches, and non-first pitches. Overall, the lowest WHIP occurs on a swinging strike with a called strike as the second lowest overall WHIP. The highest overall WHIP is a called ball. For the first pitch, a swinging strike has the lowest mean WHIP and the highest is a called ball.

Figure 6.3 Mean WHIP by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches

Figure 6.3 takes Table 6.3 and displays the mean WHIP by result of pitch for all pitches, first pitches, and non-first pitches. WHIP is very similar for all three types of pitches. The lowest mean WHIP is when the pitch results in a swinging strike and the highest occurs when the pitch is a ball.

The lowest mean overall WHIP occurs when the result of the pitch is a swinging strike.

The second lowest is a called strike. Putting the ball in play has the third lowest mean WHIP and a ball results in the highest WHIP. Therefore, if pitchers keep throwing balls or have a lot of balls put in play, then they have a higher chance of obtaining baserunners, which would increase their

WHIP. The first pitch also shows that the lowest mean WHIP occurs at a swinging strike and the 64 highest WHIP is a first pitch ball. This indicates that getting a first pitch strike is significant in having a lower WHIP and keeping baserunners off the bases.

6.4 Earned Runs Average Earned Runs ERA = * 9 Innings Pitched

Given that an error doesn’t occur, earned (ERA) is the number of runs given up per innings pitched. ERA is one of the most popular statistics used to evaluate pitchers. ERA does not consider the defense behind the pitcher. For example, the pitcher might have an extremely good defense on their team, so the number of earned runs is low. Since defense and even umpires or stadiums can influence a pitcher’s ERA, this statistic is not used to predict a future performance of the pitcher ("Complete List (Pitching)"). Another downfall for ERA is that if a pitcher leaves a game and there are runners on-base, earned runs that are scored in the remainder of the inning count against the pitcher that is taken out. Therefore, ERA is predominately used for measuring a starter pitcher compared to a relief pitchers ("Glossary").

Table 6.4 Mean ERA and Standard Error by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches OVERALL FIRST PITCH NON-FIRST PITCH Result of Pitch Mean SE Mean SE Mean SE Ball 4.1200 0.0052 4.1318 0.0099 4.1152 0.0060 Called Strike 4.0667 0.0071 4.0841 0.0102 4.0497 0.0097 Swinging Strike 4.0234 0.0055 4.0271 0.0134 4.0227 0.0061 Put in Play 4.1384 0.0073 4.1424 0.0197 4.1377 0.0078 Table 6.4 explores the mean ERA for all pitches, first pitches, and non-first pitches by result of the pitch. Overall, the lowest ERA occurs when the result of the pitch is a swinging strike and the highest is when the ball is put in play by the hitter. The same conclusions can be made for the first pitch.

65

Figure 6.4 Mean ERA by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches

Figure 6.4 takes Table 6.4 and displays the mean ERA by result of pitch for all pitches, first pitches, and non-first pitches. Similar to the other pitching statistics, the lowest mean ERA for all three types of pitches occurs for a swinging strike. However, the highest ERA is when the pitch is put in play on all three types of pitches.

The lower the ERA, the better the pitcher’s performance for the season. The lowest overall mean ERA occurs when the result of the pitch is a swinging strike. A called strike is the second lowest ERA, followed by a called ball and putting the ball in play. Therefore, when the hitter can put the ball in play, this results in the highest ERA for the pitcher. The first pitch has the same outcomes. Getting a first pitch strike, either a swinging strike or a called strike, results in the lowest

ERA. Putting the ball in play has the highest ERA because when the hitter is able to make contact with the ball, there is a higher chance that they are able to get on-base and possible score which would result in at least one earned run.

6.5 Fielding Independent Pitching

13*HR + 3* BB + HBP - 2*K FIP = + FIP constant IP 66

Another statistic that allows for the comparison of pitchers regardless of their defense or luck is Fielding Independent Pitching (FIP). FIP looks at home runs, walks, hit by pitches and strikeouts. All of these measures are in the control of the pitcher and does not require the assistance of the defense. An average FIP is around 3.8. The lower FIP indicates that the pitcher is giving up less home runs, less walks, and hit by pitches, and producing strikeouts. The constant is roughly around 3.1 and is used to bring the statistic to an ERA scale. FIP is more efficient when looking at a whole season instead of individual games. More than a couple of innings are needed to determine the pitcher’s performance ("Complete List (Pitching)"). FIP is more accurate than ERA because

FIP removes the role of the defense.

Table 6.5 Mean FIP and Standard Error by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches OVERALL FIRST PITCH NON-FIRST PITCH Result of Pitch Mean SE Mean SE Mean SE Ball 4.0678 0.0026 4.0765 0.0049 4.0643 0.0031 Called Strike 4.0208 0.0035 4.0271 0.0050 4.0147 0.0050 Swinging Strike 3.9806 0.0028 3.9758 0.0072 3.9815 0.0030 Put in Play 4.0720 0.0035 4.0573 0.0086 4.0747 0.0038 Table 6.5 illustrates that the lowest overall FIP is a swinging strike with a called strike following behind. Putting the ball in play and a called ball have the highest overall FIP for the pitcher. Looking at the first pitch, a swinging strike has the lowest mean FIP and a called ball has the highest mean FIP.

67

Figure 6.5 Mean FIP by Result of Pitch for First Pitches, Non-First Pitches, and All Pitches

Figure 6.5 takes Table 6.5 and displays the mean FIP by result of pitch for all pitches, first pitches, and non-first pitches. Swinging strikes resulted in the lowest mean FIP for all three types of pitches. For first pitches, the highest mean FIP happened when the pitch was a called ball. Put in play had the highest mean FIP for all pitches and non-first pitches.

Again, with this statistic, a swinging strike and a called strike overall lead to a lower mean

FIP compared to a ball or the hitter putting the ball in play. Overall, the highest FIP occurs when the ball is put into play. For the first pitch however, the highest mean FIP is when the first pitch is a ball. Therefore, if the pitcher can get a first pitch strike on the hitter, then this results in the highest mean FIP for the pitcher,

Tables 6.6 and 6.7 give a summary of the rankings of the means for each statistic. Table

6.6 focuses on the overall effects (all pitches) and Table 6.7 just focuses on the first pitches. A “1” represents the highest mean and “4” is the lowest mean. The statistics in bold mean that it is better for the pitcher to have a lower mean. Therefore, a “4” indicates that the result of the pitch is the best for the pitcher for the statistics in bold.

There is an advantage to the pitcher to get the hitter to swing and miss or swing and produce a foul ball, but then if they make contact with the ball this decreases the benefit for the pitcher. 68

Therefore, there is a risk to getting the hitter to swing on the first pitch. If the pitcher can get the hitter to produce a swinging strike, then the pitch is beneficial for the pitcher. However, if the hitter makes contact and keeps the ball fair, then this is the worst case for the pitcher. Putting the ball in play for all pitches was the worst outcome in 4/5 (ERA, LOB%, FIP, and BABIP) of the pitching statistics. A swinging strike for all pitches was the best outcome for all pitching statistics.

Throwing a ball on the first pitch was the worst outcome in two statistics (FIP and WHIP) and putting the ball in pay was the worst statistics for the other three statistics (BABIP, LOB%, and

ERA). A swinging strike on the first pitch gave the best outcome for all pitching statistics. All the pitching statistics showed that getting a first pitch strike is crucial to the pitching statistics.

For hitting, a first pitch ball resulted in the highest hitting statistics in all four of the statistics previously looked at. In three out of the four hitting statistics (OBP, SLG%, and OPS), putting the ball in play on the first pitch was the worst for that statistics. The worst outcome for the hitter’s BA was a swinging strike on the first pitch. Looking at all pitches, the best outcome for the hitter in three out of the four statistics (OBP, SLG%, OPS) was a called ball. For BA, putting the ball in play resulted in the best outcome. A called strike was the lowest outcome for all pitches in three out of the four hitting statistics (BA, SLG% and OPS). For OBP, the lowest outcome for all pitches was a swinging strike.

69

Table 6.6 Summary of Rankings for Mean Hitting and Pitching Statistics Overall Result of Pitch Called Swinging Put in Statistic Ball Strike Strike Play

BA 2 4 3 1 OBP 1 3 4 2 SLG% 1 4 2 3 Hitting OPS 1 4 3 2 BABIP 2 3 4 1

LOB% 3 2 1 4 WHIP 1 3 4 2

Pitching ERA 2 3 4 1 FIP 2 3 4 1

Table 6.7 Summary of Rankings for Mean Hitting and Pitching Statistics for First Pitches Result of Pitch Called Swinging Put in Statistic Ball Strike Strike Play

BA 1 2 4 2 OBP 1 2 3 4 SLG% 1 3 2 4 Hitting OPS 1 3 2 4 BABIP 2 3 4 1

LOB% 3 2 1 4 WHIP 1 3 4 2

Pitching ERA 2 3 4 1 FIP 1 3 4 2 Tables 6.6 and 6.7 display the summary of rankings of mean hitting and pitching statistics for the all pitches and first pitches. A “1” represents the highest mean and “4” is the lowest mean. The statistics in bold mean that it is better for the pitcher to have a lower mean. Therefore, a “4” is the best for these statistics. All the pitching statistics showed that getting a first pitch strike is crucial to the pitching statistics. A called ball on the first pitch was the worst outcome for pitchers for 2/5 pitching statistics and when the hitter put the ball in play was the worst outcome for the other three pitching statistics. The best pitching outcomes for first pitches occurred on a swinging strike for all five pitching statistics. For hitting, a first pitch ball resulted in the highest hitting statistics in all four of the statistics and a swinging strike or putting the ball in play were the lowest means for the hitting statistics.

70

SECTION 7: SWING VS NO SWING

As explored in the previous sections, the first pitch is crucial to the at-bat. But now why do hitters swing or not swing at the first pitch? Are there certain situations that make the hitters more likely to swing? “The first pitch is a complicated moment in any at-bat. The strike rate on the pitch is rising as pitchers pump it in the zone to get their advantage. And yet the swing rate on the pitch is diving as hitters try to be more patient in an era in which on-base percentage is praised” (Sarris).

About 29% of the time the hitter swings at the first pitch. But what are the factors that might change this number? Does the runner situation cause hitters to swing more often on the first pitch? Do certain innings or positions in the batting order result in higher swing rates on the first pitch? In this section, the effects of count, the inning, runners on-base, the batting order in which they hit, and the location of the pitch on the swing rates are explored. Jeff Sullivan wrote a blog post in

2015 for the website FanGraphs titled “Return of the First-Pitch Swing”. In this posting, he explores the rise of the first-pitch swing and his explanation to the rise of first-pitch swings:

To simplify everything: Hitters are behaving more aggressively. They had been more and

more willing to let pitchers get ahead 0-and-1, but this year hitters have fought back, as if

they finally noticed the trend. Pitchers are throwing better stuff than ever. They’re getting

more strikeouts than ever. Pitchers these days, when ahead, are incredibly difficult to hit.

Hitters have increasingly tried to get to them early, before they get in front. It’s a theory I

remember Cash talking about in the article I can’t find. Hitters should go up there ready to

hit, because the first pitch might well be the best pitch to hit that they see. Patience for

patience’s sake is arguably and probably outdated. (Sullivan) 71

Throughout the following two sections, many factors are explored to determine if they contribute to the increase in swing rates. The first section, Section 7.1, will use logistic regression models and Section 7.2 will focus on more sophisticated model called generalized additive models.

7.1 Logistic Regression Modeling

The overall goal of logistic regression modeling is to explain the relationship between a binary response variable and a number of independent variables. The response is binary which means that it only has two values. For example, binary variables might be male/female, yes/no, etc. They are usually coded “1” if the characteristic is present or “0” if not. In the male/female example, males might be coded as “1” and females as “0”. Our binary response is whether the hitter swings (coded as “1”) or not (coded as “0”) on all pitches and on first pitches. The goal is to learn which situations or other variables might influence whether the hitter swings or not. The fit of a logistic model allows for the prediction of the outcome. One can obtain fitted probabilities of swinging given different scenarios like baserunners, inning or the count.

The simple form of logistic regression has the form:

logit p = β0 + β1x1

p where p is the probability that the binary response is “1” and logit p = log . Using this form 1 - p results in the expected logs of the odds of the outcome being a “success.” This form makes it difficult to interpret the results. By taking the inverse of the logit function, the equation now becomes:

β + β x e 0 1 1 p = β + β x 1 + e 0 1 1 72 which gives the probability of “success” for the predictor, p. The simple logistic regression equation explores how the one predictor variables influences the probability of success, or probability of swinging in our case. Instead of just one predictor variable, x1, there could be multiple predictor variables. Having multiple predictor variables is known as multiple logistic regression and has the following equation:

p log = β + β x + … + β x 1 - p 0 1 1 n n

Again, taking the inverse of the multiple logistic regression equation:

β + β x + … + β x e 0 1 1 n n p = β + β x + … + β x 1+e 0 1 1 n n where p is the probability of swinging, x1, … , xn, are the predictor variables and β0, β1, …, βn are the unknown coefficients. The focus is on fitting the probability of swinging based on the predictor variables count, placement of runners, order in the batting lineup, pitch distance from the center of the strike zone, and inning. In the remaining sections, logistic regression models are used to investigate how each single predictor variable influences the probability of swinging.

7.1.1 Probability of Swinging Given the Count

Considered first is the probability of swinging given the count. The count has a role if the hitter will swing or not. For example, the hitter is not likely to swing on a 3-0 count unless the pitch is in their or in other words a perfect pitch. If the count is 0-2, the hitter has to protect the plate and may have to swing at anything close to the strike zone to avoid a strikeout.

Since count is a categorical variable, eleven binary variables have to be created to determine the probability of swinging for each count. The logistic regression model for count becomes:

logit p = -0.901 + 0.823x01 + 0.948x02 + 0.598x10 + 1.097x11 + 1.348x12 + 0.589x20 + 1.307x21 + 1.546x22 - 1.557x30 + 1.161x31 + 1.934x32 73

where p is the probability of swinging, x01 is “1” if the count is 0-1 and “0” otherwise, x02 is “1” if the count is 0-2 and “0” otherwise, x10 is “1” if the count is 1-0 and “0” otherwise, x11 is “1” if the count is 1-1 and “0” otherwise, x12 is “1” if the count is 1-2 and “0” otherwise, x20 is “1” if the count is 2-0 and “0” otherwise, x21 is “1” if the count is 2-1 and “0” otherwise, x22 is “1” if the count is 2-2 and “0” otherwise, x30 is “1” if the count is 3-0 and “0” otherwise, x31 is “1” if the count is 3-1 and “0” otherwise, and x32 is “1” if the count is 3-2 and “0” otherwise. If the count is 0-0, then all the variables would be “0” and the equation just becomes logit p = -0.901.

Just looking at the 3-2 count for example, the simple logistic regression model would be: logit p = -0.901 + 1.934x32 and since the count is 3-2, x32 is “1” which results in logit p = -0.901 + 1.934*1. Taking the inverse of this equation:

e -0.9011 + 1.934*1 e 1.033 p = = = 0.738 1 + e -0.9011 + 1.934*1 1 + e 1.033

Therefore the probability the hitter swings on a count of 3-2 is 0.738.

Figure 7.1 shows the swing rate for each count along with whether the count favors the hitter, favors the pitcher, or if the count is neutral. Recall that the count favors the hitter when the count is 1-0, 2-0, 2-1, 3-0, 3-1, or 3-2 and favors the pitcher when the count is 0-2, 1-2, 2-2, or 0-

1. The neutral counts are 0-0 (first pitch) or 1-1. From Figure 7.1, the 3-0 count and the 0-0 count have the lowest swing rates. On the other hand, 3-2 and 2-2 counts have a highest swing rates of over 60% of the time a hitter will swing on these counts.

74

Figure 7.1 Probability of Swinging Given the Count and Count Type

Figure 7.1 describes the relationship between the count or count type and the probability of swinging. Hitters are less than 10% likely to swing when the count is 3-0. When the count is 3-2 or 2-2, hitters are more than 60% likely to swing.

7.1.2 Probability of Swinging Given the Placement of the Runners

The placement of the runners can have an impact on whether the hitter chooses to swing or not. There are eight possible runner situations that can occur and they are: no runners on, runner on 1st base, runner on 2nd base, runner on 3rd base, runners on 1st and 2nd, runners on 1st and 3rd, runners on 2nd and 3rd, runners on 1st, 2nd, and 3rd (bases loaded). If no one is on-base, then the hitter might be more patient at the plate and see more pitches. With a runner on 1st base, the hitter could hit the ball anywhere so they don’t have to be as selective in the pitch that they hit. Whereas with a runner on second with less than two outs, the hitter is trying to get a pitch that they can hit up the middle or to the right side so that the runner can move and hopefully score. Therefore, the hitter is more selective early in the count with runners on second or third. Considered first are all pitches. By looking at all pitches, a baseline is created for the probabilities of swinging given runners. After this baseline is established, then just the first pitch is explored and the results can be compared to all pitches.

75

The logistic regression model for runners and all pitches becomes:

logit p = -0.073 + 0.024x1B2B + 0.086x1B3B - 0.026x2B - 0.031x2B3B - 0.012x3B + 0.129xloaded - 0.086xnone

st nd where p is the probability of swinging for all pitches, x1B2B is “1” if runners are on 1 and 2 base

st rd and “0” otherwise, x1B3B is “1” if runners are on 1 and 3 base and “0” otherwise, x2B is “1” if

nd nd rd runners are on 2 base and “0” otherwise, x2B3B is “1” if runners are on 2 and 3 base and “0”

rd otherwise, x3B is “1” if runners are on 3 base and “0” otherwise, xloaded is “1” if bases are loaded

st nd rd (runners on 1 , 2 , and 3 base) and “0” otherwise, and xnone is “1” if there are no runners on- base and “0” otherwise. If there is a runner on 1st base, then the rest of the variables would be “0” and would just be left with logit p = -0.073. The equation for the first pitch follows the same format, but just has different coefficient estimates:

logit p = -0.716 - 0.014x1B2B + 0.172x1B3B - 0.045x2B + 0.5x2B3B - 0.021x3B + 0.143xloaded - 0.347xnone

Figure 7.2 shows the probability of swinging with runners on-base for all counts and first pitches. When runners are on 1st and 3rd base or bases are loaded, then the probability of swinging is over 50% for all pitches. The red line in the right graph of Figure 7.2 indicates the average swing rate of 0.29 for first pitches. The highest swing rate on first pitches is when runners are on 1st and

3rd base, bases loaded, or runners on 2nd and 3rd base. When any runners are on, then the swing rate is higher than the average for first pitches. Meanwhile, with no runners on, the swing rate on first pitches is below the average of 0.29.

76

Figure 7.2 Probability of Swinging for All Pitches and First Pitches Given the Placement of the Runners

Figure 7.2 shows the probability of swinging overall and for first pitches given the runner situation. The lowest probability of swinging occurs when no one is on-base for all pitches and first pitches. Bases loaded has the highest probability that the hitter will swing for all pitches and the second highest probability for first pitches. For first pitches, having runners on 1st and 3rd base has the highest probability that the hitter swings. When runners are on 1st base, 1st and 2nd, or 1st and 3rd results in the top three of four highest probabilities of swinging for all pitches.

7.1.3 Probability of Swinging Given the Position of the Hitter in the Batting Lineup

Does the position in the batting order affect whether the hitter swings or not? They are nine players that hit in the batting lineup and there is usually some strategy to where coaches place players in the lineup. The first person to hit is called the “lead-off” hitter. The player that is usually the fastest is placed in this spot. The lead-off hitter comes to the plate the most times and usually without runners on-base, so teams want the lead-off hitter to be able to get on-base as many times as possible so that the other players can get hits to move them. The 2nd person in the batting order is usually someone who can move the runners. They are usually one of the top three hitters in the lineup. The third spot is usually the second-best hitter. They are supposed to be able to drive in the lead-off runner. Another strategy is to put someone in the third spot who may not be the top three hitters because they usually come to bat with two outs and usually no runners on. The 4th spot, usually called the “cleanup,” goes to the top hitter. This person usually comes to bat at the most 77 critical times and therefore should have the most power on the team. The number five hitter is usually the fourth best hitter, if the third position is not one of the top hitters. This hitter is usually close to abilities of the . The rest of the positions, six through nine, are filled in decreasing order with usually the sixth person being the person who steals more likely and the seven through nine spots would be players who are likely to just hit singles from time to time

(Kalkman). Sky Kalkman in the article called “Optimizing your Lineup by The Book” summarizes by stating “the lineup spots rank in importance of avoiding outs: #1, #4, #2, #5, #3, #6, #7, #8, #9”

(Kalkman). The logistic regression model for batting lineup and all pitches becomes:

logit p = 0.521 - 0.001xb2 - 0.002xb3 + 0.01xb4 + 0.02xb5 + 0.022xb6 + 0.02xb7 + 0.016xb8 + 0.028xb9

nd where p is the probability of swinging, xb2 is “1” if the hitter is 2 in the batting lineup and “0”

rd otherwise, xb3 is “1” if the hitter is 3 in the batting lineup and “0” otherwise, xb4 is “1” if the

th th hitter is 4 in the batting lineup and “0” otherwise, xb5 is “1” if the hitter is 5 in the batting lineup

th and “0” otherwise, xb6 is “1” if the hitter is 6 in the batting lineup and “0” otherwise, xb7 is “1” if

th th the hitter is 7 in the batting lineup and “0” otherwise, xb8 is “1” if the hitter is 8 in the batting

th lineup and “0” otherwise, and xb9 is “1” if the hitter is 9 in the batting lineup and “0” otherwise.

If the hitter is hitting 1st in the lineup, then the equation becomes logit p = 0.521 because all of the other variables are “0”. The equation for the first pitch follows the same format, but just has different coefficient estimates:

logit p = -1.04 + 0.02xb2 + 0.077xb3 + 0.155xb4 + 0.199xb5 + 0.215xb6 + 0.189xb7 + 0.169xb8 + 0.23xb9

Figure 7.3 shows the probability of swinging on all pitches and first pitches given the position the player is in the batting lineup. Looking at Figure 7.3, the players in spots 9, 6, and 5 have the highest rates of swinging at the first pitch and players in spots 9, 6, and 7 have the highest 78 swing rates for all pitches. For all pitches, the player in spot three has the lowest probability of swinging compared to the lead-off hitter for first pitches. The red line on the right graph shows the average swing rate on the first pitch of 0.29. All of the hitters from spot 4 and on have swing rates that are higher than the average for first pitches. Hitters in spots 1, 2, and 3 have the lowest swing rates for all pitches and first pitches. This is due to these players usually being more patient as they want the better pitches to hit.

Figure 7.3 Probability of Swinging for All Pitches and First Pitches Given the Position in the Batting Lineup

Figure 7.3 showcases the probability of swinging given the batting order for all pitches and first pitches. Hitters in position 1, 2, and 3 have the lowest probability of swinging for all pitches and first pitches. Hitters in positions 6 and 9 have the highest probability of swinging for all pitches and first pitches.

7.1.4 Probability of Swinging Given the Inning

Does inning influence how often hitters swing at the first pitch? Recall that there are nine innings in each baseball game, unless the score is tied at the end of nine innings and then more innings are played until one team outscores the other. If it is early in the game, then hitters might be more patient at the plate. Whereas if it is later in the game and their team is losing, then hitters 79 might be more willing to swing at pitches. First considered is the probability of swinging for all pitches. Eight binary variables are created in the model. By looking at all pitches, a baseline is created. After this baseline is established, then just the first pitch is explored and the results can be compared to all pitches.

The logistic regression model for the inning and all pitches becomes:

logit p = -0.189 + 0.042x2 + .048x3 + 0.0.078x4 + 0.076x5 + 0.097x6 + 0.093x7 + 0.102x8 + 0.119x9

nd where p is the probability of swinging, x2 is “1” if it is the 2 inning and “0” otherwise, x3 is “1”

rd th if it is the 3 inning and “0” otherwise, x4 is “1” if it is the 4 inning and “0” otherwise, x5 is “1”

th th if it is the 5 inning and “0” otherwise, x6 is “1” if it is the 6 inning and “0” otherwise, x7 is “1”

th th if it is the 7 inning and “0” otherwise, x8 is “1” if it is the 8 inning and “0” otherwise, and x9 is

“1” if it is the 9th inning and “0” otherwise. If it is the 1st inning, then all of the variables would be

“0” and the equation would be logit p = -0.189. Looking at just the first pitch, the logistic regression equation becomes:

logit p = -1.13 + 0.109x2 + 0.153x3 + 0.31x4 + 0.278x5 + 0.333x6 + 0.28x7 + 0.284x8 + 0.292x9

Figure 7.4 shows the probability of swinging on all pitches and first pitches based on the inning. As the innings played increases, so does the probability of swinging. The 9th inning has the highest swing rate of greater than 48% for all pitches. The highest swing rate on the first pitch occurs in the 6th inning and the lowest is during the 1st inning. For first pitches, the first three innings have swing rates that are lower than the average swing rate of 0.29, while the other six innings are all higher than the average swing rate.

In the beginning of the game, the hitters are more likely to be patient and not swing at the first pitch. This is usually the first time the hitter is seeing the pitcher and may want to take a pitch 80 or two to get a feel for the pitcher. Also, by taking pitches early in the game, this might cause the starting pitcher to have to throw more pitches than expected and have to be pulled from the game sooner. Whereas later in the game, there might be a new sense of urgency to get runs in to either pull away from the other team or to try and catch up. Therefore, hitters might be more likely to swing at the first pitch to try and advance runners or they might have seen the starting pitcher a few times now so could be ready to swing at the first pitch.

Figure 7.4 Probability of Swinging for All Pitches and First Pitches Given the Inning

Figure 7.4 highlights the probability of swinging in general and for first pitches given the inning. Overall, hitters are more likely to swing at pitches as the innings increase. The 6th inning has the highest probability of swinging on the first pitch compared to the 9th inning for all pitches. Innings 1, 2, and 3 have the lowest swing probabilities for all pitches and first pitches.

7.1.5 Logistic Model for All Players

Now that there is a general understanding of the variables that might affect the swing rate, a logistic model for all players will be created. Seven total variables are considered in the full multiple logistic regression model. Multiple logistic regression model allows for the determination of which combination of variables is significant in predicting the probability that the hitter will swing. For example, the model may suggest that inning and order in the batting lineup may not give be important in deciding whether the hitter is likely to swing or not. 81

The first variable considered in the model was whether runners were on-base or not. Based on Figure 7.2, the lowest probability of swinging occurred with no runners on for all pitches and first pitches. This variable was coded “1” if there were any runners on-base and “0” if no runners were on-base. Figure 7.4 showed that there might be an early inning effect on the swing rate. The first three innings had the smallest probability of swinging for all pitches and first pitches.

Therefore, the second variable included in the full model was whether the inning was in the first three innings or not (coded “1” if the inning was 1, 2, or 3 and coded “0” otherwise). The pitch location is likely to be significant in predicting whether the hitter will swing or not. If the pitch is out of the strike zone, then the hitter is probably less likely to swing. Therefore, the third variable considered was the distance (in feet) from the center of the strike zone for each pitch. The count has an important role as to whether the hitter will swing or not. Thus, to consider the count type, two binary variables (“Favors Pitcher” and “Neutral Count”) were created. “Favors Pitcher” is if the count favored the pitcher (coded “1” if count is 0-2, 1-2, 2-2, or 0-1 and “0” if not) and the second was if the count was neutral (coded “1” if count is 0-0 or 1-1 and “0” if not). If both the

“Favors Pitcher” and “Neutral” variables were “0”, then the count favored the hitter. Another factor not previously mentioned that might affect swing rate is the quality of the pitcher. If the pitcher is average or above average pitcher, then hitter might be more likely to swing. Pitchers are considered below average if their ERA is greater than 3.75. The pitcher quality variable was coded “1” if the pitcher’s ERA was equal to or below 3.75, meaning that the pitcher was an above average pitcher and “0” if the pitcher’s ERA was greater than 3.75. Lastly, the number of times up to bat might affect the swing rate. If the hitter is coming to the plate for the first time, then they might be more patient at the plate because they want to see more pitches to try and get a feel the pitcher’s tendencies, whereas if it is later in the game and the hitter has already seen the pitcher two or more 82 times, then the hitter might come to the plate more likely to swing earlier in the count. The variable called “First PA” is coded “1” if the plate appearance is the first for the hitter and “0” otherwise.

A total of seven variables (“Runner On”, “Early Inning”, “Favors Pitcher”, “Neutral”, “Distance”,

“Pitcher Quality”, and “First PA”) are used to fit a multiple logistic model. The coefficient estimates and p-values from the full model help to determine the variables that are significant in predicting swing rate.

Recall that the equation of a multiple logistic regression equation is:

p log = β + β x + … + β x 1 - p 0 1 1 n n where p is the probability of swinging,, �, . . . , �, are the predictor variables and β0, β1, …, βn are the unknown coefficients. The method of maximum likelihood finds the coefficient estimates.

How should one interpret the coefficient estimates? The coefficient estimate is the estimated increase of the logistic odds for each unit increase of the predictor variable, given that all of the predictor variables in the model are held constant. For example, if the multiple logistic regression

p equation was: log = 0.32 + 1.75x - 4.14x . For each unit increase of x , there is an estimated 1-p 1 2 1 increase of 1.75 to the logistic odds given that x remains constant. An explanation of the coefficient estimate for x2 is that for every one unit increase of x, the logistic odds decreases by

4.14 holding x1 constant.

How does one tell if the predictor variables are significant in predicting the response? In logistic regression, each variable is tested to determine if the variable is making a significant contribution to the overall model. A low p-value means that the variable is significant in the model or, in other words, changes to the predictor influence changes in the response. Typically a p-value that is less than 0.05 is considered significant. The logistic model for all players is used to show 83 an example of looking at the full model to determine the reduced model. The reduced model is the model with only the significant variables. Table 7.1 shows the full model of all players and all pitches. The equation for the full logistic regression model is:

logit p = 1.979 + 0.231(Runner On) - 0.112(Early Inning) - 1.966(Distance) + 0.691(Favors Pitcher) - 0.654(Neutral) + 0.045(Pitcher Quality) - 0.025(First PA) where p is the probability of swinging. From Table 7.1, all of the variables have p-values that are less than 0.05 except for “First PA”. Therefore, all of the variables are significant in predicting whether the hitters swing or not, except for first plate appearance.

Table 7.1 Coefficient Estimates for Full Model for All Players and All Pitches Coefficient SE P-value Estimates Intercept 1.979 0.009 0.000 Runner On 0.231 0.006 0.000 Early Inning -0.112 0.006 0.000 Distance -1.966 0.006 0.000 Favors Pitcher 0.691 0.008 0.000 Neutral -0.654 0.007 0.000 Pitcher Quality 0.045 0.006 0.000 First PA -0.025 0.023 0.283 Table 7.1 shows the coefficient estimates for a full logistic model for all players along with the associated standard error (SE), and p-value. Bolded terms mean that the variables are significant at a significance level of 0.05. All variables are significant except for “First PA”, which is “1” if this is the first plate appearance for the hitter and “0” otherwise.

84

Table 7.2 Coefficient Estimates for Reduced Model for All Players and All Pitches Coefficient SE P-value Estimates Intercept 1.979 0.009 0.000 Runner On 0.231 0.006 0.000 Early Inning -0.112 0.006 0.000 Distance -1.966 0.006 0.000 Favors Pitcher 0.691 0.008 0.000 Neutral -0.654 0.007 0.000 Pitcher Quality 0.045 0.006 0.000 Table 7.2 shows the coefficient estimates for the reduced logistic model for all players along with the associated standard error (SE), and p-value. Bolded terms mean that the variables are significant at a significance level of 0.05.

Since only first plate appearance is not significant, then the model can be trimmed to include all the variables except first plate appearance. Table 7.2 gives the reduced logistic regression model:

logit p = 1.979 + 0.231(Runner On) - 0.112(Early Inning) - 1.966(Distance) + 0.691(Favors Pitcher) - 0.654(Neutral) + 0.045(Pitcher Quality)

Runner On is “1” if there are any runners on-base and “0” otherwise. Early Inning is “1” if the inning is 1, 2, or 3 and “0” otherwise. D is the distance in feet from the center of the strike zone. Favors Pitcher is “1” if count is 0-2, 1-2, 2-2, or 0-1 and “0” otherwise. Neutral count is “1” if the count is 0-0 or 1-1 and “0” otherwise. If the count favors the hitter, then Favors Pitcher and

Neutral Count are both “0,” Pitcher Quality is “1” if the pitcher’s ERA was equal to or below 3.75 and “0” otherwise. If the count favors the hitter than the logistic regression model just becomes:

logit p = 1.979 + 0.231(Runner On) - 0.112(Early Inning) - 1.966(Distance) + 0.045(Pitcher Quality)

because “Favors Pitcher” and “Neutral Count” are both “0”.

Instead of looking at just the regression coefficients, the odds ratio (OR) can be more helpful. “The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure” (Szumilas). 85

Calculation of the odds of an increase to the response variable can be done by taking the

β β exponential function and raising it to the regression coefficients e 1, … , e n .

How does one find the odds ratio using the coefficient estimates? Let’s just look at the variable “Runner On”. To find the odds ratio for “Runner On”, consider the simple logistic equation for the runner situation: logit p = 1.979 + 0.231 Runner On . With runners on-base, the logistic equation becomes: logit p = 1.979 + 0.231*1 = 2.21. With runners not on-base, the logistic equation is logit p = 1.979 + 0.231*0 = 1.979. The difference is log odds is

2.21 - 1.979 = 0.231. To convert this to the change in logs, take the exponential function and raise it to 0.231 (e0.231 ≈ 1.26). When runners are on-base, there is about a 26% increase in the odds that the hitter swings for all pitches compared to when runners are not on-base.

Table 7.3 shows the regression coefficients and odds ratios for the reduced logistic model for all players and all pitches. The odds of players swinging are 1.995 times higher when the count favors the pitcher controlling for the other variables. When the count is neutral, players are roughly

48% (1-0.520) less likely to swing. For each one foot increase of distance from the center of the strike zone, the odds of the players’ swinging decreases by roughly a factor of 7 (1/0.140). When runners are on-base, the odds of the players swinging are 1.261 times higher than when runners are not on-base. Hitters are 11% (1-0.893) less likely to swing when the inning is early compared to innings later in the game. When the pitchers have ERAs that are less than 3.75, the hitter is about 5% more likely to swing compared to hitters with higher ERAs.

86

Table 7.3 Coefficient Estimates and Odds Ratio for Reduced Model for All Players and All Pitches Coefficient Odds P-value Estimates Ratio Intercept 1.979 0.000 Runner On 0.231 0.000 1.261 Early Inning -0.112 0.000 0.893 Distance -1.966 0.000 0.140 Favors Pitcher 0.691 0.000 1.995 Neutral -0.654 0.000 0.520 Pitcher Quality 0.045 0.000 1.047 Table 7.3 reveals that the odds of players swinging are 1.995 times higher when the count favors the pitcher controlling for the other variables. When the count is neutral, players are roughly 48% (1-0.520) less likely to swing. For each one foot increase of distance from the center of the strike zone, the odds of the players’ swinging decreases by roughly a factor of 7 (1/0.140). When runners are on-base, the odds of the players swinging are 1.261 times higher than when runners are not on- base. Hitters are 11% (1-0.893) less likely to swing when the inning is early compared to innings later in the game. When the pitchers have ERAs that are less than 3.75, the hitter is about 5% more likely to swing compared to hitters with higher ERAs.

Now let’s turn the attention to just the first pitch. All of the variables except first plate appearance were significant in predicting whether the hitter will swing or not for all pitches. Are these variables also significant for first pitches? Count type is not included in the full model because only the first pitch is considered right now. The variables in the full model are “Runner

On,” “Early Inning,” “Distance,” “Pitcher Quality,” and “First PA.” The coefficient estimates for first pitches for all players is shown in Table 7.4. All of the variables have p-values that are less than 0.05, which means that they are all significant in predicting whether the hitter swings or not.

The final logistic regression model for first pitches:

logit p = 0.505 + 0.520(Runner On) - 0.254(Early Inning) - 1.615(Distance) + 0.056(Pitcher Quality) - 0.153(First PA)

Table 7.5 illustrates the odds ratio for the logistic regression model for all players and just first pitches. For each one foot increase of distance from the center of the strike zone, the odds of the players’ swinging decreases by roughly a factor of 5 (1/0.199). When runners are on-base, the 87 odds of the players swinging are 1.681 times higher than when runners are not on-base. Hitters are

22% (1-0.776) less likely to swing when the inning is early compared to innings later in the game.

When the pitchers have ERAs that are less than 3.75, the hitter is about 6% more likely to swing compared to hitters with higher ERAs. Lastly, when the hitter is coming to the plate for the first time in the game, they are about 14% less likely to swing compared to other plate appearances.

Table 7.4 Coefficient Estimates for Full Model for All Players and First Pitches Coefficient SE P-value Estimates Intercept 0.505 0.015 0.000 Runner On 0.520 0.012 0.000 Early Inning -0.254 0.013 0.000 Distance -1.615 0.013 0.000 Pitcher Quality 0.056 0.012 0.000 First PA -0.153 0.051 0.0024 Table 7.4 shows the coefficient estimates for the full logistic model for all players and first pitches. All of the variables are significant in predicting the probability the hitter swings.

Table 7.5 Coefficient Estimates and Odds Ratio for Reduced Model for All Players and First Pitches Estimate P-value Odds Ratio Intercept 0.505 0.000 Runner On 0.520 0.000 1.681 Early Inning -0.254 0.000 0.776 Distance -1.615 0.000 0.199 Pitcher Quality 0.056 0.000 1.057 First PA -0.153 0.0024 0.858 Table 7.5 indicates that for each one foot increase of distance from the center of the strike zone, the odds of the players’ swinging decreases by roughly a factor of 5 (1/0.199). When runners are on-base, the odds of the players swinging are 1.681 times higher than when runners are not on- base. Hitters are 22% (1-0.776) less likely to swing when the inning is early compared to innings later in the game. When the pitchers have ERAs that are less than 3.75, the hitter is about 6% more likely to swing compared to hitters with higher ERAs. Lastly, when the hitter is coming to the plate for the first time in the game, they are about 14% less likely to swing compared to other plate appearances.

88

7.1.6 Logistic Models for Certain Players

Now that there is a general understanding of the variables that are significant in affecting the swing rate for all players on first pitches and all pitches, a logistic model is produced for only certain players. Each hitter has different tendencies. Therefore, creating multiple logistic regression models for each player will help determine which variables affect the swing rate for each player. Some players may be less likely to swing at pitches when there are no runners on- base. Players might be more willing to swing when it isn’t their first at-bat of the game. The inning might be a deciding factor for certain players. There are players that are defined as mature players and other players that are classified as “free swingers.” Both types of players are explored to help determine what variables might be important in helping to decide if the hitter will swing or not.

To accomplish this a multiple logistic regression model is created for all pitches as well as for first pitches for each player. The four players chosen are Mike Trout, Victor Martínez, Salvador Pérez, and Adam Jones.

The first player examined is Mike Trout. Mike Trout is one of the current top hitters who exhibit extreme patience at the plate. Mike Trout was drafted as the 25th pick in 2009 to the Los

Angeles Angels as a center-fielder. It wasn’t until 2012 that he made his debut to the major leagues. Trout ended up winning the AL Rookie of the Year in 2012 and received the AL MVP in

2014. Trout “became the first player in the 82-year history of the game to win back-to-back All-

Star MVP awards” (Augustyn). In 2015, Trout had a career-high 41 home runs and led the AL with a 0.590 slugging percentage. Trout is considered one of the best all-around players in the 21st century (Augustyn). Mike Trout’s swing rate for all pitches is about 0.37.

Just like in the full logistic regression model for all players, seven total variables were considered for Mike Trout’s logistic regression model for all pitches. Figure 7.6 shows the 89 coefficient estimates for the full model for Mike Trout for all pitches. Only “Distance,” “Favors

Pitcher,” and “Neutral” are significant in predicting the probability that Mike Trout swings for all pitches.

Table 7.6 Mike Trout’s Coefficient Estimates for Full Model for All Pitches Coefficient SE P-value Estimates Intercept 1.604 0.139 0.000 Runner On 0.049 0.095 0.6022 Early Inning -0.207 0.122 0.0891 Distance -1.770 0.096 0.000 Favors Pitcher 0.614 0.113 0.000 Neutral -1.191 0.120 0.000 Pitcher Quality 0.001 0.093 0.9897 First PA 0.116 0.131 0.3745 Table 7.6 shows the coefficient estimates for the full logistic model for Mike Trout along with the associated standard error (SE), and p-value. Bolded terms mean that the variables are significant at a significance level of 0.05. Only distance and count type are considered significant in predicting the probability of Trout swinging.

Table 7.7 shows the new coefficient estimates for Mike Trout’s reduced model for all pitches. The reduced logistic regression model for Trout is:

logit p = 1.578 - 1.772 Distance + 0.613 Favors Pitcher - 1.189(Neutral Count) where p is the probability of swinging, Runner On is “1” if there are any runners on-base at all and

“0” otherwise. Distance is the distance in feet from the center of the strike zone. Favors Pitcher is

“1” if count is 0-2, 1-2, 2-2, or 0-1 and “0” otherwise. Neutral count is “1” if the count is 0-0 or 1-

1 and “0” otherwise. If the count favors the hitter, then Favors Pitcher and Neutral Count are both

“0.” Table 7.8 shows the regression coefficients and odds ratios for Mike Trout’s logistic model.

The odds of Trout swinging are 1.864 times higher when the count favors the pitcher. When the count is neutral, Trout is roughly 70% (1-0.304) less likely to swing, controlling for distance. For each one foot increase of distance from the center of the strike zone, the odds of Trout swinging decreases by roughly a factor of 6 (1/0.1699). 90

Table 7.7 Mike Trout’s Coefficient Estimates for Reduced Model for All Pitches Coefficient SE P-value Estimates Intercept 1.578 0.121 0.000 Distance -1.772 0.096 0.000 Favors Pitcher 0.613 0.113 0.000 Neutral -1.189 0.120 0.000 Table 7.7 shows the coefficient estimates for the reduced model for Mike Trout along with the associated standard error (SE) and p-value. Only “Distance”, “Favors Pitcher” and “Neutral” were significant in the full model, so these are the only variables included in the reduced model.

Table 7.8 Coefficient Estimates and Odds Ratio for Mike Trout’s Reduced Model for All Pitches Coefficient Odds P-value Estimates Ratio Intercept 1.578 0.000 Distance -1.772 0.000 0.1699 Favors Pitcher 0.613 0.000 1.846 Neutral Count -1.189 0.000 0.304 Table 7.8 indicates that the odds of Trout swinging are 1.846 times higher when the count favors the pitcher controlling for distance. When the count is neutral, Trout is roughly 70% (1-0.304) less likely to swing, controlling for distance. For each one foot increase of distance from the center of the strike zone, the odds of Trout swinging decreases by roughly a factor of 6 (1/0.1699).

How does the probability of swinging for Mike Trout change depending on the distance from the center of the strike zone and the count type? Figure 7.5 graphs the probability of swinging on the y-axis and the distance of the pitch from the center of the strike zone for each count type.

Mike Trout has the highest probability of swinging when the count favors the pitcher. When the count favors the pitcher, most hitters are more likely to swing because they don’t want to fall even farther behind in the count or strikeout. When the count favors the hitter, the hitter can be more selective on which pitches they swing at. Also, as the distance of the pitch from the center of the strike zone increases, the probability of swinging decreases. When the pitcher is perfect (distance=

“0”) and the count is neutral, Trout’s probability of swinging is only about 0.57. 91

The pitch would still be considered a strike up to one foot from the center of the strike zone. Anything greater than one foot from the center of the strike zone would not be a strike. Thus when the pitch is about one and a half feet from the center of the strike zone, Trout has a probability of swinging of about 13% when the count is neutral, 25% when the count favors Trout, and about

38% when the count favors the pitcher. These numbers are pretty low, which again shows that

Trout is a selective hitter and is very likely to only swing at strikes. When the pitch is two feet from the center of the strike zone (no longer a strike), Trout is less than 25% likely to swing at the pitch regardless of the count type.

Figure 7.5 Probability of Mike Trout Swinging by Distance from Strike Zone and Count Type for All Pitches

Figure 7.5 shows the probability of swinging by distance and count type for Mike Trout. When the pitch is perfect (distance of 0), Trout is about 88% likely to swing at the pitch when the count favors the pitcher, 80% when the count favors Trout, and only about 62% likely to swing when the count is neutral. When the count favors the pitcher, Trout has the highest probability of swinging regardless of distance of the pitch.

Previously all pitches were considered, but what influences Mike Trout to swing or not on the first pitch? Is distance still significant? Are there other variables that are now significant that weren’t for all pitches? Like before for all players, count type is no longer included in the full model since only first pitches are considered. For first pitches, the only variable that was significant in Trout’s multiple logistic regression model was distance. 92

logit p = -1.026 - 1.977 Distance

Table 7.9 shows the coefficient estimates and odds ratio for Trout’s reduced model. For each one foot increase of distance from the center of the strike zone, the odds of Trout swinging decreases by roughly a factor of 3.3 (1/0.302).

Table 7.9 Coefficient Estimates and Odds Ratio for Mike Trout’s Reduced Model for First Pitches Coefficient P-value Odds Estimates Ratio Intercept -1.026 0.000 Distance -1.198 0.000 0.302 Table 7.9 shows the coefficient estimates and odds ratio for Trout’s reduced model. Only distance is significant in determining the probability that Trout swings on first pitches. For each one foot increase of distance from the center of the strike zone, the odds of Trout swinging decreases by roughly a factor of 3 (1/0.302).

The next player is Victor Martínez. Martínez is another player who has extreme patience at the plate. Martínez was signed to the Cleveland Indians in 2002 as a and got his debut in that same year. He remained with the Indians until the middle of 2009 when he was traded to the Red Sox. He played both catcher and 1st base for both teams. In 2011, Martínez went to the . In August 2014, Martínez was named the Player of the Month. He came in second to Trout for the AL MLP. “On November 6, 2014, it was announced that Martínez won his second , as the top-hitting ” ("Víctor”).

In 2015, Martínez hit his 200th career home run. He is also a more mature hitter who only swings at pitches inside the strike zone.

Martínez’s overall swing rate for all pitches is 48% which is 11% higher than Mike Trout.

The same seven variables as Mike Trout were considered in the full model for Victor Martínez.

The variables that were found to be significant in predicting the probability of swinging in general 93 for Martínez were if runners were on-base or not, the distance of the pitch from the center of the strike zone, and the count type. The reduced logistic regression model for Martínez is:

logit p = 2.05 + 0.286(Runner On) - 1.882 Distance + 0.665 Favors Pitcher - 1.141(Neutral Count) where p is the probability of swinging, Runner On is “1” if there are any runners on-base at all and

“0” otherwise. Distance is the distance in feet from the center of the strike zone. Favors Pitcher is

“1” if count is 0-2, 1-2, 2-2, or 0-1 and “0” otherwise. Neutral count is “1” if the count is 0-0 or 1-

1 and “0” otherwise. If the count favors the hitter, then Favors Pitcher and Neutral Count are both

“0.” Table 7.10 shows the regression coefficients and odds ratios for Victor Martínez’s logistic model. The odds of Martínez swinging are 1.944 times higher when the count favors the pitcher and 1.33 times higher when there are runners on-base. When the count is neutral, Martínez is roughly 68% (1-0.319) less likely to swing, controlling for distance. For each one foot increase of distance from the center of the strike zone, the odds of Martínez swinging decreases by roughly a factor of 6.5 (1/0.152).

Table 7.10 Coefficient Estimates and Odds Ratio for Victor Martínez’s Reduced Model for All Pitches Coefficient P-value Odds Estimates Ratio Intercept 2.05 0.000 Runner On 0.286 0.014 1.33 Distance -1.882 0.000 0.152 Favors Pitcher 0.665 0.000 1.944 Neutral Count -1.141 0.000 0.319 Table 7.10 highlights the coefficient estimates and odds ratios for Victor Martínez’s reduced logistic regression model. The odds of Martínez swinging are 1.944 times higher when the count favors the pitcher and 1.33 times higher when there are runners on-base. When the count is neutral, Martínez is roughly 68% (1-0.319) less likely to swing, controlling for distance. For each one foot increase of distance from the center of the strike zone, the odds of Martínez swinging decreases by roughly a factor of 6.5 (1/0.152).

94

Similar to Figure 7.5, Figure 7.6 graphs the probability of swinging on the y-axis and the distance of the pitch from the center of the strike zone for each count type. Victor Martínez has the highest probability of swinging when the count favors the pitcher. Also, as the distance of the pitch from the center of the strike zone increases, the probability of swinging decreases. When the pitch is perfect (distance=0) and the count is neutral, Martínez’s probability of swinging is about

0.74, which is higher than Trout’s. When the pitch is about 1.5 feet from the center of the strike zone (no longer a strike), then Martínez has a probability of swinging of about 13% when the count is neutral, 27% when the count favors Martínez, and about 45% when the count favors the pitcher.

These numbers are a little high considering that the pitch is no longer a strike. When the pitch is two feet from the center of the strike zone, Martínez is less than 25% likely to swing at the pitch regardless of the count type.

Figure 7.6 Probability of Victor Martínez Swinging Given that Runners Are Not On-Base for All Pitches

Figure 7.6 shows the probability of swinging by distance and count type for Victor Martínez. When the pitch is perfect (distance of 0), Martínez is about 88% likely to swing at the pitch when the count favors the pitcher, 80% when the count favors Martínez, and only about 69% likely to swing when the count is neutral. When the count favors the pitcher, Martínez has the highest probability of swinging regardless of distance of the pitch.

95

Given that count type if fixed, then exploration into the probabilty of swinging for the distance and runner situation can occurr. Figure 7.7 shows the probabilty of swinging for Victor

Martinez based on the distance of the pitch from the center of the strike zone and whether there are runners on-base or not. Martínez is slightly more likely to swing when runners are on-base. If there are runners on-base, then hitters may want to try and advance the runners so that the runners can try and score. The only way for the hitter to advance runners is to put the ball in play and putting the ball in play can only happen if the hitter swings. If no runners are on-base, then the hitter can be more selective at the plate.

Figure 7.7 Probability of Victor Martínez Swinging Given that the Count Favors the Hitter for All Pitches

Figure 7.7 displays the probability of swinging for Victor Martínez given that the count favors Martínez. When runners are on-base, Martínez is more likely to swing at the pitch regardless of the distance.

What factors might influence Martínez to swing on the first pitch? Like before count type is no longer included in the full model. For first pitches, the only variable that was significant in

Martínez’s logistic regression model was distance.

logit p = -0.107 - 1.160 Distance 96

Table 7.11 shows the coefficient estimates and odds ratio for Martínez’s reduced model.

For each one foot increase of distance from the center of the strike zone, the odds of Martínez swinging decreases by roughly a factor of 3.2 (1/0.313).

Table 7.11 Coefficient Estimates and Odds Ratio for Victor Martínez’s Reduced Model for First Pitches Coefficient P-value Odds Estimates Ratio Intercept -0.107 0.677 Distance -1.160 0.000 0.313 Table 7.11 shows that for each one foot increase of distance from the center of the strike zone, the odds of Martínez swinging decreases by roughly a factor of 3 (1/0.313).

Finally, logistic regression equations are created for Salvador Pérez and Adam Jones. Both of these players are known for being “free swingers”. According to The New Dickson Baseball

Dictionary, a free swinger is defined as “a batter who tends to swing at pitches out of the strike zone; one who will swing at almost any ball pitched to him. Free swingers seldom walk” (Dickson).

Adam Jones and Salvador Pérez are both known for their first-pitch swing rates.

Salvador Pérez started his major league career with the in 2011 where he played catcher. In 2013, Pérez was awarded the AL as a catcher. The Gold

Glove is an “award handed out to players who had a superior fielding season. A total of 18 Gold

Gloves are handed out each year with Gloves going to one player at each position for both the

National League and American League. The winners of the award are determined by the votes of coaches and managers in the league” ("Gold”). During the Championship,

Pérez was named the World Series Most Valuable Player. “Pérez has always been a free swinger, of course. In The Hardball Times just last week, Perez was ranked among the ten worst hitters in the game at making “correct” swing choices” (Petriello). 97

Salvador Pérez’s average swing rate for any pitch if 0.54. More than half of the time Pérez will swing when he is at-bat. Similar to Martínez’s logistic regression model, the factors that influence the probability of Pérez are if runners are on-base, the distance of the pitch from the center of the strike zone and the count type. The reduced logistic regression model for Pérez is:

logit p = 1.95 + 0.36(Runner On) - 1.493 Distance + 0.668 Favors Pitcher - 1.04(Neutral Count) where p is the probability of swinging, Runner On is “1” if there are any runners on-base at all and

“0” otherwise. Distance is the distance in feet from the center of the strike zone. Favors Pitcher is

“1” if count is 0-2, 1-2, 2-2, or 0-1 and “0” otherwise. Neutral count is “1” if the count is 0-0 or 1-

1 and “0” otherwise. If the count favors the hitter then Favors Pitcher and Neutral Count are both

“0.”

From this equation, there is an increase of 0.36 to the logistic odds of swinging when there are runners on-base compared to when there is no one on-base. As the distance increases by one foot, then the logistic odds decrease by about 1.493. There is an increase of 0.688 to the logistic odds if the count favors the pitcher and a decrease of about 1.04 if the count is neutral. Now if each coefficient estimate is raised to the exponential power, then the odds ratio of that event occurring is determined. Table 7.12 shows the regression coefficients and odds ratios for Salvador Pérez’s logistic model. The odds of Pérez swinging are 1.951 times higher when the count favors the pitcher and 1.433 times higher when there are runners on-base. When the count is neutral, Pérez is roughly 65% (1-0.353) less likely to swing, controlling for distance. For each one foot increase of distance from the center of the strike zone, the odds of Pérez swinging decreases by roughly a factor of 4.4 (1/0.225).

98

Table 7.12 Coefficient Estimates and Odds Ratio for Salvador Pérez’s Reduced Model for All Pitches Coefficient Odds P-value Estimates Ratio Intercept 1.95 0.000 Runner On 0.36 0.002 1.433 Distance -1.493 0.000 0.225 Favors Pitcher 0.668 0.000 1.951 Neutral Count -1.04 0.000 0.353 Table 7.12 shows the coefficient estimates and odds ratios from the reduced logistic regression model for Salvador Pérez. The odds of Pérez swinging are 1.951 times higher when the count favors the pitcher and 1.433 times higher when there are runners on-base. When the count is neutral, Pérez is roughly 65% (1-0.353) less likely to swing, controlling for distance. For each one foot increase of distance from the center of the strike zone, the odds of Pérez swinging decreases by roughly a factor of 4.4 (1/0.225).

Figure 7.8 highlights the probability that Salvador Pérez swings given that there are no runners on-base. If the count favors the pitcher and the pitch is perfect (distance=0), then Pérez has the highest chance of swinging of about 0.95. Pérez has the highest probability of swinging when the count favors the pitcher regardless of the distance of the pitch from the strike zone. The pitch would still be considered a strike up to one foot from the center of the strike zone. Anything greater than one foot from the center of the strike zone would not be a strike. When the pitch is about 1.5 feet from the center of the strike zone, Pérez has a probability of swinging of about 25% when the count is neutral, 40% when the count favors Pérez, and about 62% when the count favors the pitcher. These numbers are pretty high considering that the pitch is no longer a strike. When the pitch is two feet from the center of the strike zone (no longer a strike), Pérez is still about 40% likely to swing at the pitch when it favors the pitcher.

99

Figure 7.8 Probability of Salvador Pérez Swinging Given that Runners Are Not On-Base for All Pitches

Figure 7.8 shows the probability of swinging by distance and count type for Salvador Pérez. When the pitch is perfect (distance of 0), Pérez is about 94% likely to swing at the pitch when the count favors the pitcher, 88% when the count favors Pérez, and about 70% likely to swing when the count is neutral. When the count favors the pitcher, Pérez has the highest probability of swinging regardless of distance of the pitch.

How does the probabilty of Pérez swinging change based on the distance and runner situation given that the count favors the hitter? Figure 7.9 shows the probabilty of swinging for

Salvador Pérez based on the distance of the pitch from the center of the strike zone and whether there are runners on-base or not. Pérez is slightly more likely to swing when runners are on-base.

This theme has been consistent across the three players examined thus far.

100

Figure 7.9 Probability of Salvador Pérez Swinging Given that the Count Favors the Hitter for All Pitches

Figure 7.9 displays the probability of swinging for Salvador Pérez given that the count favors Pérez. When runners are on-base, Pérez is more likely to swing at the pitch regardless of the distance.

Next, only the first pitch is looked at for Salvador Pérez. Like with Trout and Martínez, distance is the only variable that is significant in predicting the probability that Pérez will swing on the first pitch. Pérez’s reduced logistic model is:

logit p = 0.260 - 1.063 Distance

Table 7.13 shows that for each one foot increase of distance from the center of the strike zone, the odds of Pérez swinging decreases by roughly a factor of 2.9 (1/0.346).

Table 7.13 Coefficient Estimates and Odds Ratio for Salvador Pérez’s Reduced Model for First Pitches Coefficient P-value Odds Estimates Ratio Intercept 0.260 0.211 Distance -1.063 0.000 0.346 Table 7.13 illustrates that for each one foot increase of distance from the center of the strike zone, the odds of Pérez swinging decreases by roughly a factor of 2.9 (1/0.346).

101

The last player considered is Adam Jones. Adam Jones was drafted in 2003 by the Seattle

Mariners as the 37th pick in the first round where he would play as the shortstop. He made his debut in 2006 and remained with the Mariners until 2007. The Orioles then traded for him in 2008.

In 2012, Jones, for the second season in row, was named the Most Valuable Oriole and remains at the Orioles to this day (“Adam Jones (baseball)"). One critique of Jones is that he “is too much of a free swinger and would benefit significantly by taking more free passes--and cutting down on his strikeouts” (“Adam Jones Stats”).

Adam Jones has the highest swing rate of all of the players investigated thus far. Sixty percent of the time Jones swings at pitches. The same seven variables were considered in the full model for Adam Jones. The variables that were found to be significant in predicting the probability of swinging for Jones were if runners were on-base or not, the distance of the pitch from the center of the strike zone, and the count type. These are the same variables found to be significant in the models of Martínez and Pérez. The reduced logistic regression model for Jones is:

logit p = 2.749 + 0.435(Runner On) - 1.922 Distance + 0.468 Favors Pitcher - 0.818(Neutral Count)

From this equation, there is an increase of 0.435 to the logistic odds of swinging when there are runners on-base compared to when there is no one on-base. As the distance increases by one foot, then the logistic odds decrease by about 1.922. There is an increase of 0.468 to the logistic odds if the count favors the pitcher and a decrease of about 0.818 if the count is neutral. Now if each coefficient estimate is raised to the exponential power, then the odds ratio of that event occurring can be determined. Table 7.14 shows the regression coefficients and odds ratios for

Adam Jones’s logistic model. The odds of Jones swinging are 1.597 times higher when the count favors the pitcher and 1.546 times higher when there are runners on-base. When the count is neutral, Jones is roughly 56% (1-0.441) less likely to swing, controlling for distance. For each one 102 foot increase of distance from the center of the strike zone, the odds of Pérez swinging decreases by roughly a factor of 6.8 (1/0.146).

Table 7.14 Coefficient Estimates and Odds Ratio for Adam Jones’s Reduced Model for All Pitches Coefficient Odds P-value Estimates Ratio Intercept 2.749 0.000 Runner On 0.435 0.0001 1.546 Distance -1.922 0.000 0.146 Favors Pitcher 0.468 0.002 1.597 Neutral Count -0.818 0.000 0.441 Table 7.14 shows the coefficient estimates and odds ratios for the reduced logistic model for Adam Jones. The odds of Jones swinging are 1.597 times higher when the count favors the pitcher and 1.546 times higher when there are runners on-base. When the count is neutral, Jones is roughly 56% (1-0.441) less likely to swing, controlling for distance. For each one foot increase of distance from the center of the strike zone, the odds of Pérez swinging decreases by roughly a factor of 6.8 (1/0.146).

From Figure 7.10, Jones’s probability of swinging on the perfect pitch is close to 0.95 when the count favors the pitcher. When the pitch is 1.5 feet away from the center of the strike zone, no longer a strike, Jones’s probability of swinging is greater than 50% for counts that favor the pitcher, about 45% when the count favors the hitter and, and 25% when the count is neutral. This again confirms that Jones is a free swinging and is more likely to swing at most pitches regardless of their location. For all four players, the probability of swinging is highest when the count favors the pitcher regardless of the distance of the pitch. The second highest probability for all four players occurs when the count favors the hitter and lastly the lowest probability is when the count is neutral.

103

Figure 7.10 Probability of Adam Jones Swinging Given that Runners Are Not On-Base for All Pitches

Figure 7.10 shows the probability of swinging by distance and count type for Adam Jones. When the pitch is perfect (distance of 0), Jones is about 95% likely to swing at the pitch when the count favors the pitcher, 93% when the count favors Jones, and about 88% likely to swing when the count is neutral. When the count favors the pitcher, Pérez has the highest probability of swinging regardless of distance of the pitch.

Controlling for count type, the probabilty of swinging for the distance and runner situation can be explored. Figure 7.11 shows the probabilty of swinging for Adam Jones based on the distance of the pitch from the center of the strike zone and whether there are runners on-base or not. Jones is slightly more likely to swing when runners are on-base. This is the same conclusion for all four players. If the pitch is perfect (distance=0), then Jones has a probability of swinging close to 1. This is the highest probability seen for the runner situation for all four players.

104

Figure 7.11 Probability of Adam Jones Swinging Given that the Count Favors the Hitter for All Pitches

Figure 7.11 displays the probability of swinging for Adam Jones given that the count favors Jones. When runners are on-base, Jones is more likely to swing at the pitch regardless of the distance. When the count favors Jones and the pitch is perfect (distance=0), Jones has a probability of swinging close to 95% when runners are on-base.

Since Adam Jones has the highest swing rate for all pitches, will other variables besides distance be significant in predicting the probability Jones swings on the first pitch? Unlike the other three hitters thus far, distance and the runner situation influences whether Jones swings at the first pitch or not. Jones’s reduced logistic model is:

logit p = 1.112 - 1.554 Distance + 0.773(Runner On)

Table 7.15 shows that for each one foot increase of distance from the center of the strike zone, the odds of Jones swinging decreases by roughly a factor of 4.7 (1/0.211). Jones is about 2.3 times more likely to swing at the first pitch when runners are on-base compared to no runners on- base.

105

Table 7.15 Coefficient Estimates and Odds Ratio for Adam Jones’s Reduced Logistic Regression Model for First Pitches Coefficient Odds P-value Estimates Ratio Intercept 1.112 0.000 Distance -1.554 0.000 0.211 Runner On 0.773 0.000 2.266 Table 7.15 shows that for each one foot increase of distance from the center of the strike zone, the odds of Jones swinging decreases by roughly a factor of 4.7 (1/0.211). Jones is about 2.3 times more likely to swing at the first pitch when runners are on-base compared to runners not on-base.

Tables 7.16 shows the odds ratios for all four players. The runner situation was not significant for Mike Trout, but was significant for the other three players. Distance and count type were significant in all four players. When runners were on, Jones was 1.546 times more likely to swing compared to when runners were not on-base, which was the highest compared to Victor

Martínez and Salvador Pérez. Pérez has the highest odds of swinging when the count favors the pitcher and Trout has the lowest odds when the count is neutral. Looking at the distance from the center of the strike zone, Pérez’s odds of swinging decrease by roughly a factor of 4.4 for each one foot increase of distance compared to 5.8 for Trout, 6.6 for Martínez and 6.8 for Jones.

Table 7.16 Odds Ratio Comparison for All Four Players for All Pitches Odds Ratios Mike Trout Victor Martínez Salvador Pérez Adam Jones Runner On Not significant 1.33 1.433 1.546 Distance 0.1699 0.152 0.225 0.146 Favors Pitcher 1.846 1.944 1.951 1.597 Neutral Count 0.304 0.319 0.353 0.441 Table 7.16 compares the odds ratios for all pitches for the four players. “Runner On” was not significant for Mike Trout. When runners were on, Jones was 1.546 times more likely to swing compared to when runners were not on-base, which was the highest compared to Victor Martínez and Salvador Pérez. Pérez has the highest odds of swinging when the count favors the pitcher and Trout has the lowest odds when the count is neutral.

106

Figure 7.12 compares the probability of swinging for all four players by count type when runners are not on-base. Overall, when the count favors the pitcher, the hitters are more likely to swing compared to when the count favors the hitter or is neutral. When the count favors the pitcher, the hitters are more likely to swing at pitches farther away from the strike zone. At a distance of two feet from the center of the strike zone, the mean probability of all four players is roughly around 25% when the count favors the pitcher. In contrast when the count is neutral, there is only about a 10% chance that the hitters swing. As expected Adam Jones has the highest probability of swinging for all pitches when the distance is zero regardless of count type.

Figure 7.12 Probability of Swinging on All Pitches for Runners Not On-Base by Count Type

Figure 7.12 shows that when the count favors the pitcher, the hitters are more likely to swing compared to when the count favors the hitter or is neutral. Most of the time, Adam Jones has the highest probability of swinging for all pitches and runners not on-base regardless of the count type and Mike Trout has the lowest probability of swinging.

Table 7.17 shows the odds ratio for all four players for just the first pitches. The runner situation was only significant for Adam Jones. Given that there are no runners on-base (“Runner

On” = 0), for every one foot increase in distance, there is a 70% decrease that Trout will swing, 107

69% decrease that Martínez will swing, 65% decrease that Pérez, and 80% Jones will swing. Figure

7.13 shows the probability of swinging on the first pitch for each of these four players given that runners are not on-base. As expected, Adam Jones has the highest probability of swinging (about

.75) on the first pitch when the distance is zero and Mike Trout has the lowest (about 0.25). When the pitch is two feet from the center of the strike zone (no longer a strike), then Pérez appears to have a slightly higher probability of swinging on the first pitches.

Table 7.17 Odds Ratio Comparison for All Four Players for First Pitches Odds Ratios Mike Trout Victor Martínez Salvador Pérez Adam Jones Runner On Not significant Not significant Not significant 2.266 Distance 0.302 0.313 0.346 0.211 Table 7.17 shows that given that there are no runners on-base (“Runner On” = 0), for every one foot increase in distance, there is a 70% decrease that Trout will swing, 69% decrease that Martínez will swing, 65% decrease that Pérez, and 80% Jones will swing.

Figure 7.13 Probability of Swinging on First Pitches for Runners Not On-Base

Figure 7.13 shows the probability that certain players swing on the first pitch given that runners are not on-base. Adam Jones has the highest probability of swinging until the distance is about 1.5 feet from the center of the strike zone. After 1.5 feet, Salvador Pérez has the highest probability of swinging on the first pitch.

108

7.2 Generalized Additive Modeling

The next model used is the Generalized Additive Models (GAM). In general, the goal of an observational study is to determine the relationship between the predictor variables and a response variable. A normal multiple linear regression model has the form:

E Y = β0 + β1X1 + … + βnXn where Y is the response random variable, X1, … , Xn, are the predictor variables and β0,

β1, … , βn are the unknown regression coefficients. In the linear regression model, the relationship between the independent variable and dependent variable(s) is assumed to be linear.

A linear relationship means that a change in one of the predictors, when the remaining predictors are held constant, results in a constant change of the response variable.

A generalized additive model (GAM) is a generalization of linear model and has the form:

g(E Y ) = s1 x1 + … + sn(xn) where Y is the dependent variable and s1 x1 + … + sn(xn) are smooth, nonparametric functions.

“The purpose of generalized additive models is to maximize the quality of prediction of a dependent variable Y from various distributions…In other words, instead of a single coefficient for each variable (additive term) in the model, in additive models an unspecified (non-parametric) function is estimated for each predictor, to achieve the best prediction of the dependent variable values” (“Generalized”).

The fitting method is not least-squares, but a more sophisticated nonparametric method.

Kim Larsen, in the article called “GAM: The Predictive Modeling Silver Bullet” describes GAM as “an additive modeling technique where the impact of the predictive variables is captured through smooth functions which-depending on the underlying patterns in the data- can be nonlinear”

(Larsen). Larsen presents a great visual representation of GAM and is shown in Figure 7.14. 109

Figure 7.14 Visual Representation of GAM

Figure 7.14 visually shows that generalized additive models (GAM) have non-parametric functions for each predictor variable that are added together for all of the predictor variables. The non-parametric functions do not have to be linear.

Linear regression models obtain a set of regression parameters β0, β1, … , βn, whereas

GAM is an “additive combination of arbitrary univariate functions of the independent variables”

(Beck). GAM is a more sophisticated model since it allows for general functions of the covariates in the model. GAM also is flexible because it allows for “the effect of each independent variable to be modelled non-parametrically while requiring that the effect of all the independent variables is additive” (Beck).

In the following sections, GAM is used to look at the location of the pitch in more detail.

Logistic modeling assumes that the logit of a probability is a linear function of covariates like distance, inning, etc. But it is more difficult to put in the (x, y) location as a covariate since the logit of the probability is not a linear function of the (x, y) location. Instead of using just the distance from the center of the strike zone, GAM allows for the actual pitch location (x, y). The generalized additive model for the probability of the hitter swinging given the location becomes:

logit p = β0 + s(x,y) where p is the probability of swinging, x is the x-coordinate of the pitch, y is the y-coordinate of the pitch, and s( ) is the smooth function of the location of the pitch. Being able to use the actual pitch location allows for us to plot these points against the strike zone.

110

7.2.1 Probability of Swinging Based on the Location of the Pitch

One factor relevant to swinging or not swinging is the location of the pitch. One could expect that the hitters are more likely to swing if the pitch is inside the strike zone, and hitters are less likely to swing for pitches outside of the zone. Figure 7.15 shows a contour plot of the probability of swinging as a function of the location of the pitch. The red box indicates the strike zone. From the graph below, hitters will swing 40% of the time when the pitch is inside the strike zone. Hitters are less likely to swing (about 10% of the time) when the pitch is way outside the strike zone, shown by the outer blue ring.

Figure 7.15 Probability of Swinging by Pitch Location

Figure 7.15 displays a contour plot of the probability of swinging for the location of the pitch. The red box indicates the strike zone. Hitters are more likely to swing at pitches inside the strike zone, compared to pitches outside the strike zone.

7.2.2 Probability of Swinging Based on the Hitter

Not only does the location of the pitch have an effect on the swing rate, but there also is variability between hitters in this effect. Certain players are more likely to swing on pitches than other players. Usually the more mature hitters don’t swing on most pitches and demonstrate 111 patience at the plate by getting deeper into the count. Other hitters like to swing often to try and make contact with the pitch. Mike Trout and Victor Martinez are two current top hitters who exhibit extreme patience at the plate. Figure 7.16 shows the probability of swinging for all pitches for Mike Trout and Victor Martínez given the pitch location. Mike Trout very rarely swings at pitches that are outside the strike zone. Victor Martínez is also a more mature hitter like Trout.

Occasionally, Victor Martínez will be slightly more likely to swing at pitches outside of the strike zone compared to Mike Trout.

Figure 7.16 Probability of Swinging for All Pitches by Pitch Location for Mike Trout and Victor Martínez

Figure 7.16 shows the probability of swinging for all pitches by pitch location for Mike Trout and Victor Martínez. The red box indicates the strike zone. Both players have a small probability of swinging at pitches outside of the strike zone.

Recall from Section 7.1.6 that Jones and Pérez are “free swingers” and are more likely to swing on pitches than most other hitters. Figure 7.17 shows that both players have a high probability of swinging at pitches outside of the strike zone. About 30% of the time they swing at pitches outside of the strike zone compared to about 10% for Trout or Martínez. 112

Figure 7.17 Probability of Swinging for All Pitches by Pitch Location for Adam Jones and Salvador Pérez

Figure 7.17 exhibits the probabilty of swinging by pitch location for Adam Jones and Salvador Pérez. The red box indicates the strike zone. About 30% of the time Jones and Pérez swing at pitches outside of the strike zone.

113

SECTION 8: CONCLUSION

The analysis of baseball has come a long way since Bill James. The amount of data available for baseball is astronomical, as is the speed the data can be obtained. Want to have the play by play data for a certain game? All one has to do is go to Retrosheet and within a matter of minutes the user will have the data plus more. Want to know more about specific pitches thrown in a game or season? PITCHf/x data can be downloaded into R with a few commands. The possible questions answered about baseball are endless with all of the resources made available.

The overall goals of the thesis were to explore the importance of the first pitch. What is the significance of the first pitch? How does the first pitch influence swing rates? Is it beneficial to swing on the first pitch? To find answers to these questions, exploratory methods, logistic modeling, and generalized additive modeling were used. Each method provided different insight into the first pitch.

Section 1 gave an introduction to baseball and the available statistical measures of performance of players. Section 2 provided an overview of the importance of pitch count and the first pitch which led to a presentation of research questions and methodological design in Section

3. Section 4 provided an exploratory work of the first pitch including an example of the importance of the first pitch, characteristics of the first pitch, pitch location by stance and count, and hit quality.

Section 5 looked at the mean hitting statistics by the result of the pitch and Section 6 presented mean pitching statistics by the result of the pitch for all pitches, first pitches, and non-first pitches.

Lastly, Section 7 used logistic regression modeling to explore the probability a hitter would swing for all pitches and first pitches, for different situations, such as location of pitch, hitter, count, runner situation, batting order, and inning. General patterns of the probability of a hitter swinging were presented in Section 7 using generalized additive modeling (GAM). 114

Throughout these sections, an intense study of the first pitch was unveiled. The first pitch is important to the hitter and pitcher. This first pitch determines who has the advantage in the at- bat thus far. The advantage can change from pitcher to hitter and vice versa with each pitch. The last thing a hitter wants is to dig themselves a hole, like being down 0-2 or 1-2, and have to battle their way back in the count. Although hitters don’t want to get down in the count early on, they also don’t want to have a short at-bat. Hitters are looking to stretch out the at-bat and see as many pitches as possible. They want to try and make the pitcher work and raise their pitch count so that they don’t last as long in the game. Called balls for all pitches was the best outcome for hitter’s statistics and called strikes was the worst. For first pitches, called balls was the best outcome and putting the ball in play was the worst outcome. Generally, it is better for hitters to have a swinging strike vs a called strike.

Pitchers, on the other hand, want to put hitters in a hole so that they are more likely to swing at pitches that they normally wouldn’t. Throwing first pitch strikes leads to better pitcher’s statistics, like lower WHIP, more innings pitched, more strikeouts, and less walks. Pitchers want to try and get the hitters to swing but not make contact. A swinging strike was the best outcome for pitcher’s statistics as seen in Section 6. Pitchers are also trying to throw as few pitches as possible so that they can stay in the game longer.

Investigating the pitch type for the first pitch also leads to an understanding for the hitters of what to expect when they get into the batter’s box. Most of the time a non-off speed pitch is thrown on the first pitch. Four-seam fastballs are the top pitch thrown on first pitches. The second most common pitches thrown is the slider for when the stance of the hitter matches the hand of the pitcher (L-L or R-R) and a two-seam fastball otherwise. Hitters having the knowledge of the type 115 of pitch thrown on the first pitch can be very beneficial. They can anticipate the pitch and be ready to hit a fast pitch. If the pitch is not fast, they can lay off of it and wait for another pitch.

Distance from the center of the strike zone, count type and the runner situation appeared to be significant in most of the logistic models for the probability of swinging for all pitches. Each hitter has different hitting strategies they were taught and ultimately whether to swing on the first pitch is something that is specific to the hitter. Distance from the center of the strike zone was the only factor that was significant in predicting the probability the hitter swung on the first pitch, except for the case of Adam Jones.

Free-swingers, like Adam Jones and Salvador Pérez, are more likely to swing at pitches that are farther away from the strike zone, compared to more mature hitters like Mike Trout and

Victor Martínez. With runners on-base, the probability of swinging is slightly higher for all hitters.

Hitters are trying to move runners so they are more likely to swing to try and put the ball in play.

When the count favors the pitcher, hitters are more likely to swing at pitches and when the count is neutral, players are a little less likely to swing. For Adam Jones and Salvador Pérez, two free- swingers, they had a probability of swinging on the first pitch greater than 50%. The more mature hitters are less likely to swing at the first pitch and in general compared to free-swingers.

I hope that this work not only helps the public understand the importance of the first pitch, but also current and future baseball (and possibly ) players. From a pitcher’s point of view, getting the hitter to swing on the first pitch can be the best outcome for the pitcher, but can be a challenge to actually accomplish. Hitters are very particular when it comes to swinging on the first pitch. The pitch has to be in the right zone and the right type of pitch for the hitter to swing at.

Knowing that hitters are less likely to swing on the first pitch, pitchers should take advantage and try and get a first pitch strike. 116

There are tons of questions that can be answered in baseball by the use of statistics.

Additional studies can be conducted to answer questions such as: What happens when the hitter does swing and make contact with the ball? Are hitters more likely to get an out or a hit? What factors might be significant in predicating whether the ball in play will be a hit or an out? This could be looked at for both the first pitch and all pitches or broken out by count type (favors hitter, favors pitcher, or neutral count). What about the missed swings? Do they happen more often in certain situations?

Overall, understanding the importance of the first pitch is crucial for the game of baseball.

Using the information about the first pitch can help hitters and pitcher be more successful. Also, coaches could gain a new insight to the game of baseball. This research could help lead the way to using statistics is many areas of baseball to answer in depth questions.

117

REFERENCES

"About Sean." SeanLahman.com. N.p., n.d. Web. 30 Oct. 2016.

"Adam Jones (baseball)." Wikipedia. Wikimedia Foundation, 19 Feb. 2017. Web. 20 Feb. 2017.

"Adam Jones Stats, Profile, Bio, Analysis and More." The Sports Forecaster. N.p., n.d. Web. 06

Nov. 2016.

Albert, Jim. "The First Pitch Effect." Exploring Baseball Data with R. N.p., 06 Apr. 2014. Web.

09 Aug. 2016.

---. "An Introduction to Sabermetrics." N.p., 16 Aug. 2013. Web. 30 Oct. 2016.

---. "Using the Count to Measure Pitching Performance," Journal of Quantitative Analysis in

Sports 6.4 (2010): 4-8. Print.

Augustyn, Adam. "Mike Trout." Encyclopedia Britannica Online. Encyclopedia Britannica, 9

Feb. 2016. Web. 07 Nov. 2016.

Beck, Nathaniel, and Simon Jackman. "Getting the Mean Right: Generalized Additive Models."

Society for Political Methodology (1996): 1-12. Spring 2012. Web. 5 Mar. 2017.

Birnbaum, Phil. "A Guide to Sabermetric Research." Society for American Baseball Research

N.p., n.d. Web. 30 Oct. 2016.

Bruno, Ben. "Carlos Gomez's Struggles This Season Were on Full Display Monday Night."

NumberFire. N.p., 09 Aug. 2016. Web. 08 Nov. 2016.

Cameron, Dave. "Has the First-pitch Take Become a Losing Strategy?" FOX Sports. N.p., 11

June 2014. Web. 05 Nov. 2016

"Complete List (Offense)." FanGraphs Sabermetrics Library. FanGraphs, 23 Oct. 2016. Web.

24 Sept. 2016. 118

"Complete List (Pitching)." FanGraphs Sabermetrics Library. FanGraphs, 23 Oct. 2016. Web.

24 Sept. 2016.

Dhakar, Lokesh. "Baseball Pitches Illustrated." N.p., n.d. Web. 10 Oct. 2016.

Dickson, Paul. The New Dickson Baseball Dictionary. New York: Harcourt Brace, 1999. Print.

Edlin, Don. "Hitting - Game Time Approach." QCBaseball. N.p., n.d. Web. 10 Oct. 2016.

---. "Pitching - Strategy." QCBaseball. N.p., n.d. Web. 10 Oct. 2016.

Ellis, Steven. "39 Youth Pitching Strategies To Keep Hitters Guessing." YouthPitching.com.

N.p., 25 June 2016. Web. 11 Oct. 2016.

Fast, Mike. "What the Heck Is PITCHf/x?" The Hardball Times Baseball Annual 2010. Skokie,

IL: ACTA Sports, 2009. N. pag. Print.

---. "Glossary of the Gameday Pitch Fields." Fast Balls. N.p., 2 Aug. 2007. Web. 31 Oct. 2016.

"Generalized Additive Models - Introductory Overview." Statistica. N.p., 2016. Web. 05 Mar.

2017.

Gentile, James. "Swinging or Taking on the First Pitch." The Hardball Times. The Hardball

Times, 23 Aug. 2013. Web. 18 May 2016.

"Glossary." MLB.com. N.p., n.d. Web. 14 Sept. 2016.

"Gold Glove." SportingCharts. N.p., n.d. Web. 11 Oct. 2016.

Goldsberry, Kirk. "Visualizing Patterns in MLB Pitching." Grantland. N.p., 23 Oct. 2013. Web.

05 Nov. 2016.

Kalkman, Sky. "Optimizing Your Lineup By The Book." Beyond the Box Score. N.p., 09 Oct.

2012. Web. 23 Oct. 2016.

Kurtenbach, Dieter. "The Problem with the 100-pitch Limit." FOX Sports. N.p., 09 Apr. 2016.

Web. 28 Sept. 2016. 119

Larsen, Kim. "GAM: The Predictive Modeling Silver Bullet." Multithreaded. Stitch Fix, 30 July

15. Web. 05 Mar. 2017.

Lipszyc, Ruben. "Winning and Losing Pitcher." Baseball Scoring Rules. N.p., 8 Sept. 2012.

Web. 15 Jan. 2017.

Marchi, Max, and Jim Albert. Analyzing Baseball Data with R. Boca Raton: CRC, 2014. Print.

Miller, Doug. "With First-pitch Strikes, Pitchers Gain a Substantial Advantage." Major League

Baseball. MLB.com, 09 May 2013. Web. 17 May 2016.

Mills, Brian. "Pitch Classification with Mclust." Exploring Baseball Data with R. N.p., 11 Jan.

2015. Web. 20 Oct. 2016.

Petriello, Mike. "There’s Nothing Salvador Perez Won’t Swing At." FanGraphs. N.p., 22 Oct

2014. Web. 11 Nov. 2016.

"Pitch Type Abbreviations & Classifications." FanGraphs. N.p., 04 Nov. 2016. Web. 05 Nov.

2016.

Sarris, Eno. "Extreme First Pitch Swingers (and Takers)." FanGraphs. N.p., 12 May 2014. Web.

01 Nov. 2016.

"Site Overview." Retrosheet. N.p., 14 Dec. 2014. Web. 30 Oct. 2016.

Sullivan, Jeff. "Return of the First-Pitch Swing." FanGraphs. N.p., 30 Sept. 2015. Web. 18 May

2016.

Schwanke, Jim. "Hitter Discipline Can Beat Elite Pitchers." Collegiate Baseball. N.p., 6 Mar

2009. Web. 10 Oct. 2016.

Schwarz, Alan. The Numbers Game: Baseball's Lifelong Fascination with Statistics. New York:

T. Dunne, 2004. Print. 120

Szumilas, Magdalena. “Explaining Odds Ratios.” Journal of the Canadian Academy of Child and

Adolescent Psychiatry 19.3 (2010): 227–229. Print.

Tango, Tom. "THE BOOK--Playing The Percentages In Baseball." THE BOOK--Playing The

Percentages In Baseball. N.p., 31 Aug. 2012. Web. 17 May 2016.

"Technology-PITCHf/x." Sportvision. N.p., 2014. Web. 30 Oct. 2016.

Tognetti, Phil. "Strike One." The Full Windup. N.p., 20 July 2013. Web. 11 Oct. 2016.

Treder, Steve. "The Strikeout Ascendant (and What Should Be Done About It)." The Hardball

Times. N.p., 30 Jan. 2015. Web. 05 Nov. 2016.

"Víctor Martínez (baseball)." Wikipedia. Wikimedia Foundation, 17 Feb. 2017. Web. 20 Feb.

2017.

Walsh, John. "In Search of the Sinker." The Hardball Times. N.p., 6 June 2007. Web. 31 Oct.

2016.

---. "Pitch Identification Tutorial." The Hardball Times. N.p., 19 Sept. 2007. Web. 05 Nov. 2016.

121

APPENDIX A: PITCH TYPE DEFINITIONS

Four-seam fastball has little movement and it thrown relatively straight. The speed of a four-seam fastball is between 85-100 mph.

A slider is thrown at roughly 80 to 90 miles an hour which is not much slower than a fastball. When thrown, the slider usually breaks down and away from a right-handed hitter or in other words the slider is low and away from the hitter.

Two-seam fastball has the same arm motion as a four-seam fastball. The difference is the placement of the hand on the ball. In a four-seam fastball the fingers are placed across the seams of the baseball whereas for a two-seam fastball the fingers are placed along the seams. The alignment of the fingers along the seams creates more movement in the pitch and is a little slower than the four-seam.

A sinker is a form of the two-seam fastball where it is typically thrown at the bottom of the strike zone. To the hitters, the sinker appears to be a strike, but then drops out of the strike zone. The sinker is thrown at about the same speed as a two or four-seam fastball, approximately

80-90 mph.

A cutter (also called a ) is similar to the slider, but doesn’t break down as much as the slider would. This pitch breaks away from the throwing hand of the pitcher.

A change-up is a slow fastball. From the arm motion of the pitcher, there is no difference between the fastball and the change-up. The change-up only has a speed between 70-85 mph.

The curveball has about the same speed as the change-up (70-80 mph) but has lots of vertical movement. The curveball is usually thrown with top spin so the pitch starts high and can then drop into the zone or low out of the zone depending on the spin and release of the pitch.

122

Knuckle curve is a special case of the curveball. The pitcher tucks in their fingers so that only their knuckles are gripping the ball. This grip allows for more sideways movement than a normal curveball making this pitch really hard to hit.

The knuckleball has very little or no spin at all and it usually thrown under 80 mph. This pitch is very hard to throw and control so there are very little pitchers who actually use this pitch.

The splitter pitch has about the same speed as a fastball (80-90 mph) but the movement is much different. The splitter starts off straight, but then breaks down suddenly before reaching the plate. The hitter may think the pitch is a fastball, but then swing over the ball because the ball breaks down at the last second. The splitter is similar to the sinker, but has more of a downward break (Dhakar, “Pitch”).

123

APPENDIX B: DESCRIPTION OF PITCHF/X VARIABLES FOR PITCH CLASSIFICATION

The variables from Pitchf/x that help to determine the speed and movement are: start_speed, break_angle, and break_length. Looking at just the starting speed of the pitch, the break angle and the break length of the ball, pitches thrown can then be classified. The start speed is the pitch speed in miles per hour at the time that the pitch is being released by the pitcher. The break angle is the “the angle, in degrees, from vertical to the straight line path from the release point to where the pitch crossed the front of home plate, as seen from the catcher’s/umpire’s perspective” (Fast, “Glossary”). Imagine there is a straight line drawn from the release point of the pitcher to the front of the home plate. The break length is measured in footes and is the greatest horizontal distance the ball is away from that straight line. John Walsh published an article called

“In Search of the Sinker” for The Hardball Times, which gives a great description and picture of the difference between movement and break:

The break of a pitch is the maximum distance between the trajectory of a pitch and a

straight line that connects the starting and ending points of the pitch. In the graphic on the

right, the break is the length of the red line segment. Movement, on the other hand, is the

term used to describe how far a pitch moves compared to a hypothetical pitch thrown

without spin. Look at the graphic and imagine that you are seeing an aerial view of a

curveball. Had the pitch been thrown without spin it would not have curved, but rather it

would have traveled along the blue dotted line. The length of the solid blue line segment is

the movement on the pitch. Each pitch has a horizontal and vertical movement, which can

be either positive or negative, depending on which way the ball moves. There is only one

break, though, and being an absolute distance, is always positive. (Walsh, “In Search”) 124

The break angle in this graph would be the angle between the black dotted line and blue dotted line.

125

APPENDIX C: ABBREVIATIONS USED

H= total hits OR=odds ratio BB= total walks SE=standard error HBP=total hit by pitches wOBA= weighted on-base average AB= total at-bats GAM= generalized additive modeling SF= total sacrifice flies OBP=on-base percentage 1B=total singles SLG%=slugging percentage 2B=total doubles WHIP= walks per hits per innings pitched 3B=total triples BABIP=batting average on balls in play HR=total home runs LOB%= left on-base percentage K=total strikeouts ERA=earned runs average R=total runs FIP= fielding independent pitching IP=total innings pitched OPS=on-base plus slugging ER=earned runs BA=batting average

126

APPENDIX D: HELPFUL R FUNCTIONS AND LIBRARIES

Throughout the thesis process, there were many R packages and functions that were used to display and analysis the data. Every package or function used is not listed, but more so the most important ones.

One important packages used was “pitchRx”. “PitchRx” is used to download MLB gameday data. Without this function, it would be nearly impossible to get the necessary data to analyze the specific pitches and characteristic of pitches like location, speed, and break length and angle. To pull in the data, the library “dplyr” is also needed. For example, to get the pitching data for the first two games of the 2016 World Series, packages “dplyr” and “pitchRx” would have to be installed. Within package “dplyr” is a function called scrape. This function allows for the user to pick which days they want to pull data. So the command would be: scrape("2016-10-25",

"2016-10-26"). There are other functions within “dplyr” that were used. For example, the command “select” also the user to pick which variables from the PitchRx they want to keep, so that they don’t have to use unnecessary space on their computer for variables they don’t need.

“Dplyr” was also used to inner join to merge datasets. “Sample” and “summarize” are more examples of functions used within “dplyr.”

Another really important package used was “ggplot2”. This package was the foundation for creating all graphs within R. “Ggplot2” is a complex and can be tricky to get the plot you are seeking but there are tons of tutorials and websites that provide the user with help to understanding the language of ggplot. “Mclust” is another library used to create the pitch classification graphs.

Lastly, the package of “mgcv” was used when creating the generalized additive models. This package allows for generalized regression using multiple smoothing parameter estimations. 127

The main important function used throughout are “table,” “glm,” and “gam.” The “table” function helped create all of the tables presented. It is an easy way to make two-way tables or even a proportional table. The “glm” function came in handy during the logistic regression models. It fits generalized linear models to the data. Within this function, the user can specify the error distribution they want to use and the formula. In our case, binomial was used to create the logistic regression. The “gam” function fits a generalized additive model to the data. Again, the user has to specify the formula and distribution.