Pitch Type Prediction in Major League Baseball

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Plunkett, Ryan. 2019. Pitch Type Prediction in Major League Baseball. Bachelor's thesis, Harvard College.

Citable link https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364634

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Pitch Type Prediction in Major League

Baseball

A thesis presented by

Ryan Joseph Plunkett

to The Department of Statistics and The Department of Computer Science in partial fulﬁllment of the honors requirement for the joint degree of Bachelor of Arts

Harvard College Cambridge, Massachusetts

March 29, 2019 Abstract

This thesis was ﬁrst conceived as an extension of previous research done in the realm of the “pitch prediction” problem, deﬁned here as the process by which individuals competing in Major League Baseball contests decide which of their pitches they will throw in a given situation.

Prior work suggests that binary classiﬁcation (i.e. predicting whether an upcoming pitch will be a fastball or non-fastball) is possible, and thus, we hope to improve upon these accuracies and extend the research to multi-class classiﬁcation systems as well.

Since every pitcher is unique, rather than weighting all observations equally when attempting to make predictions for a speciﬁc individual, we instead introduce the idea of “similarity analysis” via kernel-weighting mechanisms, in which pitchers deemed comparable via some metric are leveraged more heavily during the training of our localized models. To identify similar pitchers, we represent individuals as three-dimensional clouds of points based on the physical attributes of their pitches, then apply the Earth Mover’s Distance algorithm to obtain a measure of distance before using unsupervised learning to cluster pitchers.

Despite our best attempts at classification (both binary and multi- class), we find that our models struggle to surpass the naive baselines set forth in prior research, suggesting that predicting pitch selection may be more difficult than previously reported. The thesis concludes by hy- pothesizing why our work may have yielded results differing from those of earlier authors while simultaneously examining the possibility of a link between pitcher predictability and performance. Acknowledgements

I owe immense thanks to a number of people, far too many to name here. I am forever indebted to the Philadelphia Phillies for their generosity in providing the data used for this research; in particular, I would like to thank Matt Klentak,

Ned Rice, and Andy Galdi for approving my use of the organization’s proprietary information. In addition, I could not have completed this project without the support of my advisors, Lucas Janson and Michael Mitzenmacher, who provided advice, feedback, and encouragement throughout the writing process. I would also like to thank my blockmates and close friends, Andrea Brown, Eric Chin,

Katie Cronin, Soﬁa Kennedy, Raquel Leslie, Nick Nava, Pablo Reimers, and

Karen Reyes, as well as my teammates on Harvard Red Line for their friendship and kindness over these past four years. In addition, I would like to thank my girlfriend Nicole Kim, who was always willing to put aside her work and instead listen to me talk about mine. Finally, and most importantly, I owe tremendous thanks to my parents, Joanne and Timothy Plunkett, for all of their love. I do not know where I would be without them, and this thesis is as much a product of their hard work as it is mine. Contents

1 Introduction 1

2 Literature Review 4

2.1 Early Work and Motivation ...... 4

2.2 Attempts at Binary Classiﬁcation ...... 5

2.3 Expansion to Multi-Class Predictive Models ...... 9

3 Data 12

3.1 Provided Features ...... 13

3.2 Engineered Features ...... 15

3.3 Summary Statistics ...... 16

3.4 Train-Test Split ...... 18

4 Methodology 20

4.1 Earth Mover’s Distance ...... 20

4.2 Kernel-Weighting Methods ...... 25

4.3 Clustering ...... 27

4.4 Replicating Previous Work ...... 29

4.5 Models ...... 32

4.5.1 Linear Discriminant Analysis ...... 32

4.5.2 Support Vector Machines ...... 33

4.5.3 Classiﬁcation Trees ...... 34

4.5.4 Random Forests ...... 34

4.5.5 Boosted Trees ...... 35

5 Results 37

5.1 Multi-Class Classiﬁcation ...... 37

5.2 Binary Classiﬁcation ...... 40 5.2.1 Unregularized Logistic Regression ...... 40

5.2.2 L1-Regularized Logistic Regression ...... 42

5.2.3 Varying Coeﬃcients Logistic Regression ...... 43

5.2.4 Linear Discriminant Analysis ...... 45

5.2.5 Support Vector Machines ...... 45

5.2.6 Random Forests ...... 47

5.2.7 Boosted Trees ...... 48

5.3 Other Forms of Binary Classiﬁcation ...... 49

5.4 Relating Predictability to Performance ...... 51

6 Conclusion 54

6.1 Overview of the Research ...... 54

6.2 Implications of the Results ...... 56

6.3 Final Thoughts ...... 59

Bibliography 60 List of Figures

1 Histogram representing the count of each pitch type found in the

cleaned data ...... 17

2 Histogram representing the number of pitches thrown by individ-

ual pitchers in the 2016-2017 MLB seasons ...... 18

3 Two one-dimensional distributions plotted along the axes result-

ing in a joint distribution and transport plan that can be opti-

mized via the EMD algorithm ...... 21

4 Bipartite ﬂow graph representing an instance of the transporta-

tion problem with three suppliers and two consumers ...... 21

5 Red Sox pitcher David Price’s distribution of pitches as a cloud

in three-dimensional space ...... 23

6 Scatter plot comparing the test performance of unregularized lo-

gistic regression to naive baseline predictions ...... 41

7 Scatter plot representing the correlation between predictions made

by the global and local regression models ...... 44

8 Scatter plot relating an individual pitcher’s test accuracy using

LDA to the number of observations thrown by him in the training

data ...... 46

9 Histogram representing the improvement of individually trained

and cross-validated random forest models over each pitcher’s naive

baseline ...... 48

10 Scatter plot displaying the relationship between pitcher predictabil-

ity and Earned Run Average ...... 51

11 Scatter plot displaying the relationship between pitcher predictabil-

ity and Fielding Independent Pitching ...... 52 12 Scatter plot displaying the relationship between pitcher predictabil-

ity and Expected Fielding Independent Pitching ...... 52

List of Tables

1 Covariates reported by Guttag and Ganeshapillai in their re-

search entitled “Predicting the Next Pitch” ...... 30

2 Average normalized coeﬃcients corresponding to the linear sup-

port vector classiﬁers trained by Guttag and Ganeshapillai . . . . 31

3 Pitchers with the highest naive test accuracies ...... 38

4 Pitchers with the lowest naive test accuracies ...... 40

5 Pitchers with the highest test accuracies using unregularized lo-

gistic regression with full feature set ...... 42

6 Pitchers with the lowest test accuracies using unregularized lo-

gistic regression with full feature set ...... 42

7 Performances of diﬀerent models when predicting fastballs vs.

non-fastballs ...... 49

8 Performances of diﬀerent models when predicting breaking balls

and oﬀspeed pitches ...... 50

9 Test accuracies of binary predictions made when considering only

the ball-strike count ...... 57 1 Introduction

Beginning with the publication of Michael Lewis’ Moneyball in 2003, the quan- titative analysis of baseball has shifted to the forefront of the public eye, with teams and fans alike attempting to better understand America’s most beloved game [11]. As technology has improved, so too has the quality of analysis: what

ﬁrst began as crude calculations done to quantify the relative value of a walk compared to a base hit has now evolved into sophisticated research leveraging measurable physical attributes captured in the midst of games, like how a

ﬁelder’s reaction time aﬀects his ability to make a catch or how a baserunner’s maximum speed as he rounds the bases impacts his chances of scoring a run.

Despite this wave of more complex analysis sweeping across the landscape of

Major League Baseball (MLB), the all-important pitcher-batter interaction remains at the heart of the sport. A common thread connects the innings, games, and seasons that accumulate across the league: all discrete events cannot begin before a pitch is thrown. Until a pitcher glances in at the catcher for the sign, comes set, and kicks his leg before stretching towards the plate, all research, analysis, and projections are rendered useless... yet the factors that motivate individual pitchers’ decisions when determining what to throw next remain relatively unexplored in public research. Rather than relying upon conventional wisdom, which suggests that pitchers rely upon their fastballs to “set up” the remainder of their arsenals, or blindly following a pitcher’s past tendencies, we instead hope to utilize machine learning to better understand how pitchers behave: namely, it is natural to question whether we might be able to accurately predict what a pitcher is likely to throw given the growing amount of information at our disposal and the development of new statistical techniques.

Such a revelation would be a rather substantial development, given that the limited work previously done on the subject has been rather inconclusive in

1 nature. Starting at a young age, pitchers are taught to try to be unpredictable

- if a batter knows what an individual is likely to throw in a given situation, he gains an inherent advantage. While obvious sequences and patterns may not be detected by the human brain, machines may be able to identify trends in pitch selection. As detailed pitch-by-pitch data becomes more widely available, it seems as if researchers are now equipped to answer the question of pitch type prediction better than ever before, and while the topic has been broached in some previous literature, we hope to provide a deeper, more novel dive into the pitch type prediction problem, leveraging earlier ﬁndings as well as our own intuition to motivate new approaches to the issue.

Prior research has indicated that batter performance tends to improve later in games as hitters face a given pitcher multiple times in the same outing, while other groups have simultaneously concluded that it may be possible to accurately predict whether a pitcher will throw a fastball or a non-fastball in a speciﬁc situation. With this in mind, it seems reasonable to wonder if these two results could be inherently linked - batters may perform better after facing a pitcher earlier in games because they have learned something about their behavior during previous plate appearances, ultimately gaining the ability to predict what pitch they will see. Our research focuses on this question as the driving force behind our work - while we cannot conclude with certainty that our models accurately approximate the workings of a hitter’s brain in determining what we believe a pitcher is likely to throw, if machine learning methods can identify signals within pitch sequencing, then perhaps this knowledge can be applied to assist batters as they prepare for crucial plate appearances, or even be cross-referenced by pitchers as they devise their own strategies for upcoming contests. Given that all events occurring on a baseball ﬁeld begin with a pitch, any semblance of an ability to provide accurate predictions about what will be

2 thrown would represent a massive competitive advantage that could drastically alter the landscape of professional baseball.

In summary, this thesis will address what we deem to be the “pitch prediction” problem, attempting to generate accurate predictions of what an individual will throw in a given situation based on contextual information and the pitcher’s past tendencies. In order to develop our own approach to the subject, previous research in this sphere will be summarized and drawn upon, as the strengths and weaknesses of past work will be used to shape our own. A brief description of the data used in this research follows, in which the provided variables are documented and then used to engineer other features that may prove helpful as covariates. After motivating the problem and outlining the information available for use in attempts to solve it, the idea of similarity analysis is introduced - namely, we explore the possibility that individuals whose pitches move similarly may also behave similarly, and thus, the idea of computing some measure of “distance” between pitchers and subsequently using said metric as part of a kernel-weighting method moves to the forefront. Finally, a series of predictive models are ﬁt to past seasons of training data and then evaluated during the 2018 regular season, with extra care taken to not only identify the most and least predictable pitchers, but to also explore the possibility that our index is somehow related to pitcher performance. The analysis concludes with an overview of our results and their implications as they pertain to Major League

Baseball, where real-time predictions of pitch sequencing could impact the fan experience as well as the on-ﬁeld product.

3 2 Literature Review

2.1 Early Work and Motivation

It is unclear exactly when researchers ﬁrst began to consider the possibility of predicting what type of pitch an individual may throw to a given batter in a particular situation, though the idea was perhaps motivated by the discovery of the “times through the order” (TTO) penalty, ﬁrst proposed by Smith in 1996

[15]. In writing an article entitled “Do Batters Learn During a Game?,” he found that if a single batter were to face the same pitcher multiple times during the same game, the batter’s individual performance would be expected to improve with each subsequent plate appearance. Though Smith did not have access to the pitch-by-pitch data that is now becoming more freely available thanks to the public propagation of baseball research, his results were rather conclusive on an at-bat level: after examining more than 1.6 million plate appearances between 1984 and 1995, he found that batting average (6% improvement), on- base percentage (3% improvement), and slugging percentage (8% improvement) all significantly increase between a batter’s first and fourth plate appearance against a specific pitcher within a given game [15]. Though Smith acknowledges that batters may not be necessarily learning, as their improved performance could perhaps be explained by any combination of increasing pitcher fatigue, a higher degree of comfort achieved as the ballpark and environment become more familiar while the game progresses, or even better vision as stadium lights begin to turn on, the mere possibility of human batters learning pitcher sequencing tactics throughout the course of a baseball game would seem to suggest that the topic merits further exploration, including the leveraging of machine learning and other more robust analysis to uncover potential trends.

In order to predict any patterns in pitch sequencing, however, one must be able to accurately identify what type of pitch is actually being thrown, an al-

4 ready diﬃcult task when relying strictly upon visual data that would be made only harder by the labor required to manually tag the video. Luckily, as outlined in Fast’s 2010 article “What the Heck is PITCHf/x?,” those researchers examining pitch-level metadata no longer have to concern themselves with this tedious process [6]. Rather than predicting what type of pitch will be thrown,

PITCHf/x instead classifies the type of the pitch that was just observed, using physical attributes of the pitch’s velocity and trajectory as features for classification. Considering that the cameras needed to track such measurements were first installed only in 2006, and the PITCHf/x system was not found in all

MLB stadiums until 2008, the tools needed to fundamentally model pitch type selection have only just recently come into the public eye, and thus it seems as if there may still be room for growth when exploring the problem [6]. While

PITCHf/x drove innovation, MLB no longer employs the system for pitch tag- ging, instead turning to Major League Baseball Advanced Media’s (MLBAM) own Statcast software, which was installed in all parks during the 2015 regular season. Though their models are proprietary, MLBAM has announced that their pitch classiﬁcations are made using a neural net algorithm taking into account pitch speed, spin rate, and movement, among other features; while our project will not consider these predictors, given that the models we produce will be trained on these classiﬁcation target labels, it still seems rather prudent to understand how they are generated.

2.2 Attempts at Binary Classiﬁcation

Though the aforementioned papers were instrumental in ﬁrst identifying the possibility that batters could learn during a game, in addition to actually providing analysts with the tools necessary to explore this thought, it would seem that the earliest research directly confronting the pitch type prediction question

5 was published by Guttag and Ganeshapillai at the Massachusetts Institute of

Technology in 2012 [8]. Rather than attempting to perform multi-class classi-

ﬁcation as we hope to do, their research, entitled “Predicting the Next Pitch,” instead focused on binary predictions, in which machine learning was employed to determine whether the upcoming pitch was more likely to be a fastball or non-fastball. Despite the lack of granularity in their produced response variable, given the obvious similarities between their analysis and our own, a fundamental understanding of their approach would seem to be appropriate.

Guttag and Ganeshapillai had access to pitch-level information from the 2008 and 2009 MLB regular seasons; in order to respect the temporal nature of their data, they chose to train their models using only 2008, then evaluate their test accuracy with the 2009 data, including only those 359 pitchers who had thrown at least 300 pitches in each season. While the researchers do not identify why they selected support vector machines as their models of choice, they do report employing support vectors with linear kernels to produce predictions; while they note that higher accuracy likely could have been achieved using the “kernel trick” of mapping their data to higher dimensions, Guttag and Ganeshapillai instead chose to limit model complexity in exchange for better interpretability, using only linear kernels so they could interpret feature coeﬃcients as proxies for their importance [8]. Because the data were not linearly separable (in large part because of the randomness associated with pitch type selection), soft-margin support vectors were used, and after tuning the hyperparameters of their models,

Guttag and Ganeshapillai reported correctly predicting whether an upcoming pitch would be a fastball or non-fastball 70% of the time using unseen test data.

This represented a mean improvement of 18% compared to their deﬁned baseline model, which would naively predict an individual’s most commonly thrown pitch type in the training set for every observation in the testing data. But rather than

6 focusing on their reported accuracy, it is perhaps more interesting to instead emphasize the features found to be most predictive in their work.

Somewhat intuitively, Guttag and Ganeshapillai reported that pitcher-batter priors were the most signiﬁcant covariates in their models - for each pitch, the researchers added columns to the data representing how often the pitcher in question threw each of his pitches against that same batter in previous meetings

[8]. While deemed to be significant, it is worth nothing that such a predictor is rather unstable, as values were often missing (i.e. unique pitcher-batter pairings happen quite frequently given MLB player turnover and imbalanced schedules), and even if the values were present, they could be quite volatile (i.e. the sample from which conclusions can be drawn is expected to be quite small). To compensate for the instability in this predictor while simultaneously respecting its importance at the individual pitcher level, we propose the idea of identifying “similar” pitchers along some axes, then using pitches thrown by these individuals to inform the prediction process as necessary, with a methodology that will be described in further detail below. In addition to identifying their pitcher-batter prior feature as a significant predictor, Guttag and Ganeshapillai also reported the following covariates, listed in decreasing order of significance: shrunk pitcher-batter prior (i.e. regressing pitchers’ usage rates against the batter in question towards their global average usage rates), pitcher-count prior (i.e pitcher’s past usage rates in identical ball-strike counts), shrunk pitcher-count prior (i.e. regressing pitcher-count prior in a similar fashion to pitcher-batter prior), previous pitch velocity, velocity gradient of the three previous pitches, previous pitch type, inning, number of outs, score differential, and baserunning state (all features were included as they were immediately prior to the delivery of the pitch in question). Given the success reported by Guttag and Ganeshapil- lai, in addition to their principled feature engineering and modeling processes,

7 it would seem rather prudent to include these covariates in our initial attempts to solve a strikingly similar problem as well.

After publishing their work in 2012, Guttag and Ganeshapillai were cited in

“Supervised Learning in Baseball Pitch Prediction and Hepatitis C Diagnosis,” published by Hoang at North Carolina State University in 2015 [10]. Like the seminal work, Hoang used the 2008 and 2009 seasons for training and testing, respectively, though the requisite number of pitches thrown in each season was raised to 750 (compared to only 300 in the previous study). Rather than ﬁtting only a single machine learning method, Hoang instead implemented a k-nearest neighbors (k-NN) model, experimented with linear discriminant analysis (LDA), and tried using support vector machines with both linear and Gaussian kernels before concluding that LDA achieved the best performance with 78% accuracy, representing an 8 percentage point improvement over Guttag and Ganeshapillai and a 21% increase when compared to the previously described naive model

[10]. To achieve this performance, Hoang claims to have used adaptive feature selection (i.e. fitting a different model for each pitcher using his own optimal feature set), and while this does not necessarily align with our own research, it does beg the question of whether a single overarching model can be fit to all pitchers that uses their identities as features, or if instead it may be more effective to train a separate model for each pitcher to better approximate their individual decision-making processes.

Like Hoang, Beaver et al. were also inspired by the work completed by

Guttag and Ganeshapillai and tried to build upon their research in the binary pitch prediction sphere, publishing their own website as part of a capstone project at the University of California-Berkeley in 2015 [3]. Though details of their project were rather scarce, Beaver et al. outlined a similar feature set to those that were described in previous research, while also suggesting that they

8 trained a meta-ensemble “stacking” method to generate their predictions. Given the obvious complexity of the problem in question, such an approach would seem to be reasonable: by aggregating machine learning methods and thus developing more sophisticated prediction functions, superior accuracy could perhaps be achieved. Though the explanation of their methodology was somewhat cloudy, it seems as if the added layer of model architecture provided by Beaver et al. did in fact prove helpful, potentially justifying the use of similar ensemble learning techniques in our own research.

2.3 Expansion to Multi-Class Predictive Models

While binary pitch type predictions are intriguing, they oﬀer little in the way of applicability when considered in the context of an actual baseball game, as a

“non-fastball” label can represent a variety of pitch types, each of which presents batters with different challenges and could induce unique strategies. Thus, more granular labels may be desired, and the first publicly documented attempt to do so was Bock’s 2015 “Pitch Sequence Complexity and Long-Term Pitcher Perfor- mance,” in which multinomial logistic regression and support vector machines were fit to pitch-by-pitch data from the 2011, 2012, and 2013 regular seasons before being evaluated during the 2013 World Series [4]. Bock found that support vectors with non-linear kernels provided the best accuracy; however, his work is rather surprising, as the reported 75% out-of-sample accuracy compares quite favorably to other previously published binary predictions. Rather than emphasizing the success he claims, it is perhaps more informative to examine the latter part of his project, in which he compares a pitcher’s predictability according to his models to the pitcher’s actual performance, assuming that individuals who sequence their pitches in a systematic manner may not experience the same success as their more unpredictable counterparts, simply because they

9 lack the element of surprise. Here, Bock’s research invites further investiga- tion, as he does claim that predictability indices are positively correlated with

Earned Run Average (ERA) and Fielding Independent Pitching (FIP), two metrics commonly used to evaluate pitcher performance and compare their results to their peers [4]. In addition to simply generating predictions for individual pitches, taking a similar approach in analysis here may prove prudent - in the aggregate, it is unclear if more predictable pitchers are actually worse. Based on Bock’s ﬁndings, he would conclude that such a hypothesis is true, though it may be worth verifying for ourselves as well.

While Sidle and Tran did not necessarily attempt to answer how predictability and performance are correlated, they did provide a more comprehensive study of the use of k-class classiﬁcation in their 2015 publication “Using Multi-

Class Classiﬁcation Methods to Predict Baseball Pitch Types” as part of a grad- uate program at North Carolina State University [14]. Despite the increased complexity of the response variable, these researchers were able to achieve 67% out-of-sample prediction accuracy, well below the 75% reported by Bock though perhaps a more believable result given the nature of the problem. After restricting their data set to the 287 unique pitchers that threw at least 500 pitches in both the 2014 and 2015 regular seasons, Sidle and Tran trained models using pitches thrown in 2013, 2014, and the ﬁrst half of 2015, then evaluated their performance using only the second half of the 2015 regular season. Like Hoang,

Sidle and Tran implemented a variety of models during the development process, ﬁnding that random forests (67% test accuracy) slightly outperformed both

LDA (66% accuracy) and linear support vectors (64% accuracy). Somewhat sur- prisingly, Sidle and Tran appear to contradict previous research, claiming that an individual’s pitch count (rather than pitcher-batter prior) is found to be the most signiﬁcant predictor included in their machine learning method, a rather

10 counterintuitive result [14]. The authors also hand-pick a few individual pitchers for analysis and identify no clear relationship between their predictability and performance, though this section of the report was incomplete, and in con- juction with the contrary argument posed by Bock, our own work shall attempt to address the inherent disconnect.

Though multi-class classiﬁcation has been previously attempted in this sphere, it seems as if this research has not yet been replicated, and this thesis will attempt to verify these results while simultaneously building upon the work that has already been done to better understand the pitch type prediction problem. Most notably, our work attempts to explore the use of similarity analysis when producing predictions - as Guttag and Ganeshapillai noted in their seminal “Predicting the Next Pitch,” prior pitches thrown in previous match-ups of a unique pitcher-batter pairing can be useful when predicting upcoming ones, but such features can be sparse and misleading because of the limited (or even nonexistent) sample size of such pitches from which to draw. In an attempt to engineer a more stable feature for all plate appearances, perhaps a measure of distance between pitchers could be computed, such that pitches thrown by similar individuals could be used to inform our predictions of what will be thrown in more novel match-ups. Our approach to computing these distances, in addition to appropriately weighting observations based on their similarity scores, can be found below.

11 3 Data

This data was generously provided by the Philadelphia Phillies, with whom I interned as a Research and Development Summer Associate for the past two seasons and whose offices I will be joining following graduation. Despite my prior relationship with the organization, I declare no conflict of interest in my work, and all thoughts and opinions expressed herein are solely those of the author and do not reflect the beliefs of the Philadelphia Phillies. Without their permission to use the proprietary Statcast data that permeates this research, such a thesis would not have been possible, and thus, I am forever grateful for their kindness. While the Phillies did agree to allow me to use their data, they requested that the files remain confidential, in addition to the code used to produce my results, and thus, my ability to provide further detail about my work outside the scope of this thesis is unfortunately limited. However, I invite the questions and comments of anyone interested in my work, and I will attempt to answer all inquiries if possible given the parameters of my agreement with the organization.

The data originated in an easily manageable format, with every observation representing a single pitch thrown across Major League Baseball between the start of 2016 and the end of the 2018 regular season, excluding all spring training, exhibition, and postseason games. While Statcast operating systems were ﬁrst installed in ballparks starting during the 2015 season, many of the units were not properly functioning or calibrated until the beginning of 2016, and thus, in the interest of maintaining a high standard of data quality for this project, the

“experimental” observations collected during 2015 were ignored for the purposes of this thesis.

While additional features were later generated in order to assist with modeling individual pitcher tendencies, it is worth noting that the data provided

12 by the Phillies actually contained quite a few covariates of its own; short descriptions of these features and their meaning can be found in the following section.

3.1 Provided Features

The features below were provided in the dataset donated by the Philadelphia

Phillies as part of their contribution to this research:

• season - the season in which a pitch was thrown

• play id - a unique alphanumeric string used to encode individual plays

• game id - a unique alphanumeric string encoding individual games played

• pitch number - integer representing how many pitches have been thrown

in a given at-bat

• inning - integer representing which inning of the game a pitch was thrown

during

• half inning - indicator variable representing whether a pitch was thrown

in the top or bottom half of an inning

• pitcher id - unique integer string representing the pitcher’s identity

• pitcher team id - unique integer string representing the pitcher’s team

• batter id - unique integer string representing the batter’s identity

• batter team id - unique integer string representing the batter’s team

• bats - character string indicating whether the batter hit right-handed or

left-handed during the plate appearance in question

13 • throws - character string indicating whether the pitcher threw right-handed

or left-handed for the plate appearance in question

• batting order - integer 1-9 identifying which spot in the batting order was

hitting during a given plate appearance

• bvp (batter vs. pitcher) - integer counting the number of times the current

pitcher has faced the current hitter during the game

• pitch count - integer representing the number of pitches thrown by a

pitcher in the current game

• balls - integer 1-3 recording the number of balls in the count prior to the

delivery of the pitch

• strikes - integer 0-2 recording the number of strikes in the count prior to

the delivery of the pitch

• outs - integer 0-2 recording the number of outs in the inning prior to the

delivery of the pitch

• away score - integer counting the number of runs scored by the away team

before the pitch was thrown

• home score - integer counting the number of runs scored by the home team

before the pitch was thrown

• pitch type - pitch type classiﬁcation produced using an underlying and

proprietary neural net algorithm (response variable)

• pitch result detail - character string identifying the result of the thrown

pitch (e.g swinging strike/called strike/ball/foul/hit/etc.)

• location x - ﬂoating point value used to measure the horizontal location

of a thrown pitch in relation to the center of home plate

14 • location z - ﬂoating point value used to measure the vertical location of a

thrown pitch in relation to ground level at home plate

After cleaning the queried data and dropping missing values originating from

ﬂaws in the Statcast operating software, we were left with just over 2.2 million pitches thrown across Major League Baseball during the 2016, 2017, and 2018 regular seasons.

3.2 Engineered Features

Having cleaned the provided data, before actually generating predictions, it seemed rather prudent to engineer some features of our own, largely using do- main knowledge to inform our choices. While the original data did include baserunning state, the information was not encoded in a numerical format, and thus, binary indicator variables were created to identify which bases where occupied by runners at the time of a pitch, as pitchers often experience diminished performance when challenged with runners in scoring position, perhaps suggesting that their sequences subtly change as well. In addition, it also seems plausible to suggest that pitchers may behave diﬀerently depending on the score of the game - they may attempt to be more unpredictable in high-leverage sit- uations arising in the late innings of close games, while in blowouts when the outcome has largely been decided, pitchers may be content to simply “pound the strike zone” with fastballs in an attempt to end games more quickly at the expense of a few runs. As such, the score diﬀerential between teams was computed at the time of every pitch in the available data.

In addition to engineering features measuring aspects of the game state at the time of an individual observation, because pitches are thrown in sequence, it is also worth considering covariates relating to the context of the plate appearance during which a pitch is thrown. For instance, if a pitcher just threw a fastball for

15 a swinging strike, he might throw yet another fastball in pursuit of a similarly positive result, or if a batter just fouled off a pitcher’s change-up, he may become more likely to follow that up with a curveball in the dirt to keep the hitter off-balance. Without considering previous pitches thrown in the sequence of an at-bat, our analysis could potentially leave out critical pieces of information that shape an individual’s choice when determining his next pitch. To account for the influence previous pitches in the same at-bat may have on subsequent ones, the pitch type, location (i.e. quadrant in which the ball crossed the two-dimensional plane perpendicular to home plate), and result (called strike, swinging strike, foul, or ball), of the three pitches immediately prior (if applicable) were one- hot encoded and appended to the dataframe, such that a number of additional features were engineered for every observation. Here, by conditioning on earlier pitches and utilizing this data to inform future predictions, the observations in question do not occur in a vacuum, but rather in the context of the at- bat in which they were thrown, perhaps more accurately approximating the information processed by pitchers when considering what pitch will be most effective in a particular situation, emphasizing the weight of the recent past in their memories.

3.3 Summary Statistics

When dealing with classiﬁcation problems, it can be informative to examine the distribution of true labels, to identify potential instances of class imbalance and simultaneously inform the development of models. Here, because the majority of professional pitchers rely on their fastballs (as seen in Figure 1), it would seem that four-seam fastballs (represented in the data as FF) and two-seam fastballs (represented as FT), should dominate our predictions, while secondary pitches like split-ﬁngered fastballs (SP), knuckleballs (KN), and slow curves

16 Figure 1: Histogram representing the count of each pitch type found in the cleaned data

(SC), may be more diﬃcult to identify because of their rarity. While class imbalance arises in a number of disciplines and is dealt with accordingly, given that the overwhelming majority of pitches thrown in the training sample are some sort of fastball, it seems likely that such classiﬁcations will dominate our test predictions as well.

In addition to exploring the labels assigned to observations, it is also worth considering how many observations can be attributed to each of the n = 425 pitchers that were included in this project. Figure 2 presents a histogram that does just this; we ﬁnd that the average number of pitches thrown in the training data per individual is 2861, though this total is somewhat skewed by outliers, as the median observation count is only 2231. As seen in the histogram, there is quite a wide range of training observations: Brandon Workman threw only 619 pitches during the two seasons in question because of injury, while renowned workhorse Justin Verlander threw 7324 pitches over that time span. Given the disparity in training observations seen, a natural question is whether there will

17 Figure 2: Histogram representing the number of pitches thrown by individual pitchers in the 2016-2017 MLB seasons be some correlation between one’s number of appearances in the data and his ultimate predictability in the test set.

3.4 Train-Test Split

Before actually fitting a prediction function to the available data, it was first necessary to identify a training and testing set for analysis. In most instances, a simple random split of the data works well enough for this purpose; however, because of the temporal structure of our problem, randomly sampling observations for training may not be the best strategy. In other words, including pitches thrown in 2018 as part of our training set, only to then evaluate the accuracy of our machine learning method by applying to it to observations seen during the 2016 regular season could represent a form of data leakage. For instance, imagine a scenario in which a pitcher grows more comfortable with one of his offspeed pitches in the off-season, and thus, he begins to throw it more frequently next year. Observing this change in his sequencing may affect our

18 expectations, yet during the current season, this should not impact our predictions, as at the time, we had no evidence of any changes. If our goal is to better estimate what a pitcher will choose to throw based on what we know when he comes set, we should strictly limit ourselves to information available at that moment. Thus, to account for this added element of time, and given that the provided set of observations included all pitches thrown during the 2016, 2017, and 2018 regular seasons, we limited the scope of our training set to only the

ﬁrst two seasons, treating the ﬁnal one as a test set upon which our models’ performances were evaluated.

19 4 Methodology

4.1 Earth Mover’s Distance

As previously mentioned, by invoking similarity analysis to overcome small sample sizes when considering previous plate appearances involving a speciﬁc pitcher, we hope to unlock illuminating information about how the individual in question may choose to behave. Thus, such an approach would seem to be rather reasonable; however, the question of actually measuring “similarity” between pitchers remains. After consulting a number of sources, it was decided that the Earth Mover’s Distance (EMD) algorithm could prove useful here, and a brief summary of the computation follows.

The EMD algorithm was first proposed by Rubner et al. at Stanford, in their paper “A Metric for Distributions with Applications to Image Databases,” published in 1998 [13]. They posit that when presented with two probability distributions as visualized in Figure 3, one can be thought of as a mass of earth spread throughout the space, while the second is considered to be a series of holes in the same space. Naturally, one may be inclined to measure the minimum amount of work needed to fill the holes using the provided earth, where “work” can be loosely defined as the distance between a bit of earth and the hole that it is eventually used to fill.

The EMD algorithm cleverly solves this problem based on a solution to the transportation problem, which must also be defined: here, the transportation problem represents a bipartite flow (as seen in Figure 4) that can be perceived as an example of linear programming [13]. Let I represent a set of suppliers, J represent a set of consumers, and ci,j account for the cost required to transport material between elements i ∈ I and j ∈ J. Then, our goal is to find the set of

ﬂows fi,j minimizing the total cost of transportation, which can otherwise be formalized as:

20 Figure 3: Two one-dimensional distributions plotted along the axes resulting in a joint distribution and transport plan that can be optimized via the EMD algorithm

Figure 4: Bipartite ﬂow graph representing an instance of the transportation problem with three suppliers and two consumers

21 ∗ fi,j = arg min Σi∈I Σj∈J ci,j fi,j , f where the minimization is subject to the following constraints:

fi,j ≥ 0 i ∈ I, j ∈ J

Σi∈I fi,j = yj j ∈ J

Σj∈J fi,j ≤ xi i ∈ I,

where xi is the total supply of supplier i and yj is the total demand of consumer j. Then, as long as total demand does not exceed total supply, or:

Σj∈J yj ≤ Σi∈I xi, computing the difference between probability distributions can be thought of as a form of the transportation problem (where this final constraint is always satisfied, as distributions must sum to 1 by definition), where the costs ci,j represent the ground distance between elements i and j. Then, the Earth Mover’s

Distance between distributions x and y can be computed as:

Σ Σ c f Σ Σ c f EMD(x, y) = i∈I j∈J i,j i,j = i∈I j∈J i,j i,j . Σi∈I Σj∈J fi,j Σj∈J yj

While the EMD algorithm was introduced in the context of probability distributions, it can also be used for our purposes, to measure the “distance” between pitchers to identify comparable individuals, provided that these pitchers can be deﬁned along some arbitrary number of axes [18]. Rather than

22 Figure 5: Red Sox pitcher David Price’s distribution of pitches as a cloud in three-dimensional space comparing pitchers, perhaps we can instead compare the physical attributes of their pitches themselves. By analyzing pitches in three-dimensional space

(release velocity, horizontal movement, and vertical movement), we can both easily visualize pitchers as clusters of points while simultaneously considering these clouds as probability mass functions that can be inputted to the Earth

Mover’s Distance algorithm for comparison.

Here, we make a rather crucial assumption: namely, we suggest that individuals whose pitches move similarly may also tend to sequence them in a similar manner. While such a belief may not necessarily be true, it does seem plausible - a repertoire of pitches that behave similarly may experience the most success if also sequenced in the same order (i.e. sweeping breaking pitches may be most eﬀective when immediately following high fastballs, throwing multiple

23 change-ups with outstanding arm-side movement may disrupt a batter’s timing, etc.). While the EMD algorithm is known for being relatively robust even when generalized to high dimensions, prior research has indicated that the metric is subject to distortion as the feature space expands, and thus, it may be beneﬁcial to limit the number of axes across which pitchers are compared [2]. Thus, for the purposes of this study, we limit our scope to only the three previously described dimensions (release speed, horizontal movement, and vertical movement), using the Earth Mover’s Distance algorithm to compute the pairwise distances between a given pitcher and each of his peers remaining in the dataset.

However, because of the computationally intensive nature of the EMD, rather than computing the true distances (i.e. using every pitch thrown by an individual as an observation); we instead decided to cluster similar pitches along these axes, then assign weights to these clusters based on how many pitches were most closely associated with this cohort. In order to accomplish this, a three-dimensional grid was established using arbitrary intervals along each axis (5 mph for release velocity, 5 inches for both horizontal and vertical movement), such that every pitcher could be represented as a weighted mesh of

392 clusters. In so doing, the computational complexity of the Earth Mover’s

Distance algorithm was lessened while simultaneously normalizing the number of pitches thrown by every individual in the data, as pitchers who had appeared with varying frequencies could now be compared across an identical set of mesh points. By employing these distances, we were able to obtain a metric approximating pitcher similarity; ideally, by weighting those observations thrown by similar pitchers, our predictions may perhaps be more accurate than those made when treating all pitchers equally. Now, we must develop a systematic approach to determine appropriate weights based on proximity.

24 4.2 Kernel-Weighting Methods

Luckily, rather than devising our own weighting mechanisms for similar observations from scratch, our research can rely upon work previously done in this realm - namely, kernel smoothing methods, or perhaps more generally, localized classiﬁcation. Like other classiﬁcation procedures, kernel-weighting mechanisms assume that observations most similar in the feature space are correspondingly similar in the output space, and thus, they explicitly leverage training data de-

ﬁned as “nearby” when making predictions on unseen data [9]. Unfortunately, because this is a memory-based procedure (one which requires not only the stor- age of the entire training set but also access to the data prior to every evaluation made on the test set), such kernel methods can be computationally challenging to implement.

However, given that we had already computed the Earth Mover’s Distance between diﬀerent pitchers, which can be thought of as a metric approximating the similarity between pairings, a kernel-based approach seemed rather natural.

Rather than apply global logistic regression, in which every training observation is weighted equally and a set of parameters is ﬁt to the data prior to evaluating the model on new data, we instead turn to local logistic regression to determine what type of pitch an individual is most likely to throw prior to his delivery.

Recall the traditional global logistic loss function for binary classiﬁcation, frequently referred to in literature as cross-entropy [9]. Here, w represents the set of weights (or coeﬃcients) assigned to each of the features provided to the model:

n 1 1 L(w) = Σi=1yi log T + (1 − yi) log 1 − T (1) 1 + e−w xi 1 + e−w xi

However, this loss function assumes that w is ﬁxed across all observations,

25 while in the local setting, we instead adjust the model’s parameters depending upon a test observation’s proximity to its neighbors in the training data. Thus, for some observation xj, we can represent the loss of the associated weights wj in the following manner:

1 1 L(w ) = Σn K(x , x )y log + (1 − y ) log 1 − j i=1 i j i −wT x i −wT x 1 + e j i 1 + e j i

Unfortunately, the global minimum of the above loss function cannot be expressed via a closed form solution; however, it can be minimized using gradient descent or some other optimizer, and thus, while computationally challenging, a numerical solution can in fact be found.

In this customized local loss function, the K(xi, xj) term represents the computed distance between observations xi and xj. Given that the pairwise Earth Mover’s Distance between all pitchers has already been defined, it can be naturally substituted here. It is worth noting, however, that higher distances between pitchers implies less similarity between the individuals in question; without transforming these distances, such values would actually assign more weight to those pitchers who are least similar to the individual in question. In- stead, we would hope to more heavily emphasize those nearby pitchers while assigning little (if any) influence to those that are most distant, and thus, the computed distances must be inverted. An ideal kernel would also be both continuous and differentiable everywhere, such that the eventual optimization of the local parameters could be simplified as much as possible [9]. While a number of kernels have been put forth in previous literature, including the Gaussian,

Epanechnikov, and tri-cube weighting schemes, for the purposes of our project, we selected a simple negative exponential, of the form:

26 −EMD(xj ,xi) K(xj, xi) = e

Here, we acknowledge that the exponential distribution is both continuous and differentiable, while also going to 0 as EMD(xj, xi) approaches infinity, another property of favorable kernel functions because the weights will die off as distance between observations increases [9]. The exponential is also rather simple, in that it requires the computation of only the Earth Mover’s Distance between two observations, such that no calculations over the dataset need be done. Though simplistic, the negative exponential satisfies many of the desir- able properties of a kernel-weighting scheme, and thus, such a function seems suitable for this project. After determining how different observations were to be weighted, it was necessary to minimize the computational overhead needed to generate localized predictions, and thus, we turned to unsupervised learning to identify cohorts of similar pitchers.

4.3 Clustering

After computing the pairwise Earth Mover’s Distance for every observed pitcher in relation to his peers in the data and devising a kernel-weighting scheme to appropriately assign inﬂuence to nearby points, the distance metric could then be used in order to cluster pitchers based on the similarity of their arsenals. This was done to identify potentially comparable individuals: namely, perhaps those pitchers whose pitches behave similarly also choose to throw them in the same order. Thus, by grouping pitchers and weighting the observations of those most comparable more heavily in the modeling process, we hope to identify common trends among clusters, informing our predictions based on the behavior of other physically comparable pitchers.

In order to actually perform this unsupervised learning, it was necessary

27 to identify an algorithm that could operate using a distance matrix as its sole input; while the k-means clustering algorithm is rather popular and eﬃcient, it is sadly unsuitable for the problem at hand [17]. Because of the abstraction of the

Earth Mover’s Distance, we are capable of observing only where an observation

(a pitcher, in this instance) lies in relation to its peers, while its location in real space is obscured. Therefore, because the k-means algorithm relies upon an understanding of the problem space (points not corresponding to observations can be chosen as centroids), it cannot be applied when only a pairwise distance matrix is provided.

However, rather than using k-means clustering, we can use an adaptation of the algorithm, commonly referred to as either k-medoids or Partitioning Around

Medoids (PAM). Unlike k-means, only observations (rather than any point in the problem space) are considered as centroid candidates, and thus, the algorithm can take either a set of feature vectors or a distance matrix as input before computing clusters. In addition to being more convenient for the problem at hand, it has also been shown that the k-medoids algorithm converges more quickly upon a solution than the k-means procedure; asymptotically, PAM runs in O(k(n − k)2) time, where k represents the number of clusters and n is the number of observations in the data [17]. The steps of the unsupervised learning algorithm can be summarized as follows:

1. Arbitrarily select k observations as cluster medoids.

2. Assign all observations to the nearest medoid.

3. Deﬁne cost as the summed distance of every observation to its nearest

medoid. Then, while cost decreases:

(a) For every medoid m and non-medoid observation o:

i. Swap m and o, re-assign observations to the nearest medoid, and

28 compute the cost of this new conﬁguration.

ii. If the cost has increased, undo the swap.

In order to actually compute the clusters observed in our data, we turned to the implementation of PAM found in R’s cluster package. Typically, unsupervised learning is used to identify similar observations and group them; here, rather than pursuing this goal, we instead hope to appropriately weight training observations when evaluating unseen data points as part of our models. In other words, these clusters are formed for computational speed-up, and thus, typical methods for choosing the optimal number of clusters k (i.e the elbow method, silhouette method, or gap statistic, among others) are likely to identify fewer medoids than desired. Thus, rather than using these more principled unsupervised learning techniques to choose the number of clusters in our data, we instead arbitrarily chose to partition the pitchers in our dataset into 50 dis- tinct groups. Then, the distance between groups can be represented by the already-computed distance between their medoids.

4.4 Replicating Previous Work

Before beginning our own analysis of the pitch prediction problem, it seemed prudent to ﬁrst attempt to replicate previously published work, considering that the majority of research done in the baseball sphere is neither peer-reviewed nor released for academic purposes. While a number of articles could have been chosen for replication, the most frequently cited paper in the literature is Guttag and Ganeshapillai’s “Predicting the Next Pitch.” Thus, their work was selected as the basis for our replication.

In their seminal work, Guttag and Ganeshapillai use linear support vectors to make binary predictions about whether subsequent pitches thrown by an individual will be fastballs or non-fastballs [8]. Having access to pitch-by-pitch

29 Performance Balls Strikes Outs Score Game State Inning Handedness Situational Shift Bases Loaded Pitch Count Priors Home Team Batting Team Count Batter Batter Proﬁle Slugging Runs Previous Pitches Velocity Location

Table 1: Covariates reported by Guttag and Ganeshapillai in their research entitled “Predicting the Next Pitch” data from 2008 and 2009, the researchers chose to train their models on the

first season and evaluate their performances on the second; they fit separate support vector machines for every pitcher in their sample who had thrown at least 300 pitches in both seasons (n = 359). For each observation in their dataset, Guttag and Ganeshapillai report a number of features as inputs to their machine learning methods, including the inning, count, number of outs, score differential, and baserunning state at the time of the pitch, the handedness of the pitcher and batter, as well as the pitcher’s pitch count, past pitches thrown between the current pitcher and batter, and past pitches thrown by the pitcher in the same count, in addition to the velocity and location of the three previously thrown pitches, and the result of the pitch thrown immediately prior. While it was not possible to exactly replicate this published work because of the limited data at our disposal, we did try to approximate Guttag and Ganeshapillai’s results as closely as possible, using at least those 12 predictors (highlighted in

Table 2) deemed to be most signiﬁcant in their research, among others.

Comparing their work to a naive baseline model (one which simply predicted an individual’s most commonly thrown pitch type in 2008 for every test observation in 2009), Guttag and Ganeshapillai reported a mean improvement of nearly

18%; in raw terms, their individually trained classiﬁers achieved 70% accuracy on average, compared to the naive model that correctly predicted the binary pitch type classiﬁcation of 59.5% of unseen test observations. Despite our best

30 Predictor Weight Pitcher-Batter Prior 0.4022 Shrunk Pitcher-Batter Prior 0.2480 Pitcher-Count Prior 0.2389 Shrunk Pitcher-Count Prior 0.2238 Previous Pitch’s Velocity 0.1529 Velocity Gradient 0.1359 Previous Pitch Type 0.1138 Inning 0.0650 Outs 0.0522 Score Diﬀerential 0.0408 Bases Occupied 0.0398

Table 2: Average normalized coefficients corresponding to the linear support vector classifiers trained by Guttag and Ganeshapillai efforts to reproduce their work (with further details provided in Section 5.2.5), we were unable to demonstrate a similar improvement, as our linear classifiers achieved mean accuracy of 61.6%, only marginally eclipsing the established naive model’s score of 61.4%.

While differing feature sets and unique data could perhaps explain the discrepancies in our findings (Guttag and Ganeshapillai trained their machine learning methods on the 2008 season and tested their performance on 2009, while we used 2016-2017 as training data and held out 2018 for evaluation), it also seems entirely possible that pitchers of the present are truly less predictable than those of the past, given the influx of analytical minds that have flooded both the playing fields and front offices of professional baseball organizations.

Unfortunately, given our inability to document signiﬁcant improvements when dealing with only binary predictions, this result would seem to portend a similar outcome when generalizing our research to k-class pitch type predictions.

Because of the challenges we encountered while trying to replicate the work of Guttag and Ganeshapillai, we did try to reach out to the authors, in an attempt to verify that our approach aligned as closely with their research as

31 possible. Unfortunately, neither author responded to electronic correspondence, and thus, it is unclear if our implementation of their approach is somehow

flawed, or perhaps our data-generating process is simply different, given that nearly a decade has passed since the publication of their work and the wealth of baseball research that has been made available to the professional community in the meantime. Every effort was made to ensure that the code written for the purposes of replication functioned properly, and after rigorous testing, we are sadly left with no sufficient explanation for this deviation in our results from those which have been previously reported.

4.5 Models

Previous research detailing the pitch type prediction problem has utilized a number of statistical techniques in order to approximate pitcher behavior, including linear discriminant analysis, random forests, and boosted trees, in addition to the aforementioned support vector machines (with both linear and higher-order kernels). A brief summary of each of these algorithms can be found below.

4.5.1 Linear Discriminant Analysis

Linear discriminant analysis (LDA) was popularized by Sir Ronald Fisher, beginning in 1936 as a derivative of Fisher’s linear discriminant [12]. Because of its usefulness in identifying linear combinations of features that characterize individual classes, LDA is frequently used for dimensionality reduction before passing along its outputs to later-developed, more sophisticated machine learning algorithms, though it can be used on its own as a classiﬁer as well. As a form of supervised learning that uses linear combinations of continuous features to predict a categorical class label, linear discriminant analysis is similar in nature to logistic regression though makes stronger assumptions, as LDA requires

32 class labels to be normally distributed in the feature space with identical covariance [12]. While it has been shown that discriminant analysis is relatively robust even when these assumptions are violated in some manner, because the prediction function relies upon the inversion of a covariance matrix (implicitly assuming that predictors are independent after conditioning upon class label),

LDA’s performance can suﬀer when presented with covariates that do exhibit multicollinearity [12]. However, when its assumptions are met, linear discriminant analysis has been shown to outperform less parametric machine learning methods.

4.5.2 Support Vector Machines

Support Vector Machines (SVMs) were ﬁrst proposed by Vapnik and Cher- vonenkis in 1963 as a form of supervised learning for classiﬁcation, in which observations could be represented as points in space and hyperplanes could be

fit to the data in an attempt to separate individual classes [1]. More formally, an observation with p features can be represented as a vector in p-dimensional space; then, if a (p − 1) dimensional hyperplane can be used to separate observations depending on their class label, the hyperplane in question can be used to define a linear classifier. However, such a hyperplane may not be unique (i.e. there may be multiple ways to separate the classes), and in these instances, the optimal hyperplane is that which maximizes the distance between its boundary and the nearest member of each class [1]. This distance is known as the margin, and here, the nearest observations used to define the margin are referred to as

“support vectors,” yielding the name of the classiﬁer. While support vector machines traditionally support only linear classiﬁcation, they can in fact be used to approximate non-linear algorithms via the “kernel trick”, in which observations are mapped to higher dimensions such that classes are linearly separable in this transformed space. However, after performing the kernel trick, support vec-

33 tor machines lose much of their interpretability, and thus, in previous attempts to address the pitch prediction problem, linear SVMs were more popular, as researchers were also attempting to perform inference and subsequently learn more about pitchers’ decision-making processes.

4.5.3 Classiﬁcation Trees

Classiﬁcation trees represent a form of decision tree learning, a subset of machine learning algorithms in which a predictive model asks a series of yes-no questions depending upon the features of an observation prior to estimating its target value

[5]. While also common in regression settings, trees are perhaps more popular in classification problems, for which a vote among training samples deemed to be similar by the decision tree is used to determine the predicted class for all test observations. Classification trees are typically fit using “recursive partitioning,” during which internal nodes of the tree filled with training observations are greedily split along some feature at a decision boundary, such that the original node is distilled into two smaller children nodes. This process continues in each sub-node (hence the “recursive” term), until either the resulting child is pure (i.e.

filled with observations belonging to only one class) or some stopping criterion is reached. While decision trees are lauded for their ease of implementation and interpretable nature, they are also quite prone to overfitting, and thus, rather than being used as a final model, classification and regression trees are instead more frequently utilized as base learners for the more advanced machine learning methods whose descriptions follow [5].

4.5.4 Random Forests

As previously stated, classiﬁcation trees are often thought of as the base models for more sophisticated algorithms, the most popular of which is likely the random forest. As implied by its name, a random forest is a collection of such

34 decision trees, synthesized in a unique manner to reduce variance known as bootstrap aggregation, or “bagging.”

Bagging was introduced by Breiman in 1994 as a form of model averaging, designed as an ensemble algorithm that could be used to stabilize the predictions of base learners, or in other words, reduce overﬁtting [5]. To achieve this goal, bagging leverages the bootstrap procedure: given a sample D of n observations, one can sample from D uniformly with replacement n times and repeat this process for B rounds, thereby producing B bootstrapeed datasets

D(1),D(2),...,D(B), designed to approximate the true population distribution.

Rather than fitting a single classification tree to the original training data, the random forest instead utilizes bagging, fitting B decision trees, one for each bootstrapped sample. In addition, rather than considering every feature before splitting an internal node of the tree during its construction via the recursive partitioning algorithm, the random forest instead employs “feature bagging,” considering only a subset of covariates for each split within the tree [5]. By doing so, the forest of trees is de-correlated, as the most significant predictors may not always be selected for splitting, and thus, the ensembling procedure minimizes the variance of the classifier while taking on only marginal bias, often providing a more optimal machine learning method than a single classification tree.

4.5.5 Boosted Trees

Though often mentioned in the same breath as bagging, “boosting” represents an entirely different procedure, one in which base learners are sequentially fit to identical training samples, such that those observations mis-classified by previous iterations are more heavily emphasized in future realizations of the predictions [7]. Like bagging, while boosting procedures can be applied to any simpler models, because of the ease with which they can be implemented and their abil-

35 ity to detect non-linear relationships between features and response variables, decision trees are most commonly employed as the base learners for boosting algorithms.

Unlike bagging, which was proposed as a method of reducing variance via ensembling, boosting was first reported by Schapire in 1990 while searching for an algorithm that could reduce the bias of “weak” learners (i.e. classifiers that do only slightly better than random guessing when assigning labels to test observations), somehow aggregating them to form a “strong” learner (i.e. a well-calibrated model) [7]. Boosting accomplishes this goal by fitting a series of decision trees, in which those observations that are ultimately mis-labeled receive heavier weights in future iterations of the model. Eventually, these weak models are ensembled (often via a linear combination of their predictions) to form a final classifier, one which significantly outperforms any of the individual base models [7]. Though boosting has been shown to exceed the performance of random forests in some instances, because of the additional computational complexity of the algorithm (bagged trees can be fit simultaneously while boosted trees must be trained sequentially because of the re-weighting step) and the time required to tune the hyperparameters used in boosting, random forests are more commonly used in practice despite the possibility of superior performance.

However, previous research in the pitch prediction sphere has not touched upon the usefulness of boosting, and thus, it was deemed a suitable extension of published work worth exploring as part of this thesis.

36 5 Results

5.1 Multi-Class Classiﬁcation

In order to evaluate the performance of our machine learning methods, it was necessary to ﬁrst establish a competitive baseline that could be used for comparison, in order to determine whether our models actually exhibited improvement compared to some naive prediction method. While a number of baselines could be selected for this purpose, we chose to employ a strategy similar to those put forth in previous research - after reviewing every pitch thrown by an individual in the training sample, his most frequently thrown pitch type in that subset of the data will be predicted for every observation in the unseen testing set, a rather naive baseline model. For instance, if Pitcher A were to throw his fastball

40% of the time in the training data, his slider 20% of the time, his curveball another 20%, and his changeup for the remaining 20%, then our naive model would predict a fastball for every pitch thrown by Pitcher A in the testing data, yielding a 40% accuracy in expectation, provided that the pitcher in question did not drastically alter his pitch type distribution.

To create such a baseline, we first had to define our training and testing sets, keeping in mind the temporal structure of our data. In essence, future pitches should not be used to predict past ones, as our machine learning methods would not have access to such data in real-time, and thus, such an approach would likely inflate our test accuracy compared to how our performance would generalize to the 2019 Major League Baseball season and beyond. Thus, to maintain the integrity of our experiment, we chose to use the 2016 and 2017 seasons as training data to ultimately make predictions about the 2018 season which was held aside for use in testing. To reduce some of the noise in modeling

(i.e. pitchers with very few observations likely do not exhibit enough signal to predict their future sequencing decisions with accuracy), we limited the scope

37 Pitcher Pitch Type Test Accuracy (%) James Pazos Sinker 100.0 Sean Doolittle Four-Seam Fastball 98.7 Steven Wright Knuckleball 96.2 Chad Green Four-Seam Fastball 95.7 Jake McGee Four-Seam Fastball 95.7

Table 3: Pitchers with the highest naive test accuracies of our research to only those individuals who had thrown at least 600 pitches in the training set and appeared at least 300 times in the test data, yielding a sample size of n = 425 unique pitchers, ranging from a minimum of 609 training observations to a maximum of 7,324 pitches.

While reasonable, this naive model is quite pitcher-dependent, and thus, for those individuals who predominantly rely upon a single pitch type, the baseline will actually be quite high, making it diﬃcult to surpass. As seen in Table 3, the list of pitchers with the highest naive test scores is dotted with relievers who often pitch only an inning or so at a time and thus require a less-developed repertoire than their starting pitcher peers, or specialists like Steven Wright, who represents something of an outlier as one of the only active knuckleballers at the highest level of professional baseball. Because these pitchers so frequently turn to their most dominant pitch, we see that our baseline model is able to achieve impressive test accuracies by simply always predicting that pitch, and thus, because of this outstanding performance, it seems unlikely that any machine learning method will be able to achieve substantial improvement in these instances due to the class imbalance present in the training data.

On the other hand, however, as demonstrated in Table 4, there are also pitchers in the dataset for whom the naive model performs quite poorly. While somewhat unexpected, these low baseline scores can perhaps be explained by a variety of factors. First, it is worthwhile to note that the pitch type labels we hope to predict do not necessarily represent the “ground truth,” but instead are

38 a product of Major League Baseball Advanced Media’s (MLBAM) attempt to approximate such truth via an underlying neural network. Physical attributes of every pitch thrown during a game, including release velocity, horizontal movement, and vertical movement (among many other measurements), are recorded by MLBAM’s Statcast system and fed as input to a classiﬁcation algorithm, one which buckets similar pitches based on their movement proﬁles and assigns pitch type labels to them. In order to actually know what type of pitch an individual had just thrown, we would have to ask him, and unfortunately, we do not have the luxury of this access; thus, we are forced to rely upon the outputs of

MLBAM’s proprietary system. This could potentially explain the naive model’s poor performance in some instances - if a pitch thrown by an individual was deemed one label by the classiﬁcation algorithm in 2017 and found itself close to a decision boundary, then because of an update to the system, a small change in the pitcher’s grip, or perhaps random chance, the same pitch thrown in 2018 could perhaps shift across the boundary, being labeled as an entirely diﬀerent type and thereby confusing our predictive model through no fault of its own.

It is worth noting that four-seam and two-seam fastballs often exhibit similar characteristics, and that the line dividing sliders and curveballs is notoriously blurry; all of these labels appear in Table 4 as relatively unsuccessful naive predictions, implying that such ambiguity may be to blame. While slightly less plausible, one also must acknowledge that our data-generating process is driven by humans, professional baseball players that evolve from pitch to pitch, game to game, and season to season. It would be irresponsible to suggest that a pitcher’s approach is stagnant and never-changing, and thus, it is possible that individuals could overhaul their repertoires in the oﬀ-season, learning new pitch types that may lead to more success than their current proﬁles. In these instances, our models would likely once again be fooled, as such drastic changes

39 Pitcher Pitch Type Test Accuracy (%) Kevin Jepsen Two-Seam Fastball 33.3 Jerry Blevins Sinker 33.9 Chaz Roe Curveball 35.9 Zach Britton Two-Seam Fastball 37.5 Madison Bumgarner Four-Seam Fastball 38.7

Table 4: Pitchers with the lowest naive test accuracies in pitch sequencing would be nearly impossible to predict.

5.2 Binary Classiﬁcation

After more sophisticated algorithms failed to significantly improve upon the previously proposed baseline, it was decided that perhaps our initial goal of multi-class classification was too aggressive, and in turn, we decided to focus our efforts upon binary classification (i.e. determining whether a given pitch would be a fastball or a non-fastball). Though a less interesting result, by restricting the number of response labels to consider, the prediction problem is inherently simplified, such that traditional logistic regression and other prediction methods can be trained on data from the 2016 and 2017 seasons before being evaluated on held-out 2018 test data.

5.2.1 Unregularized Logistic Regression

To predict the probability of a fastball being thrown, we began our analysis with a simple logistic regression model, one which included the entire feature set (45 predictors outlined above). Though not incredibly powerful, the logistic regression framework is rather interpretable, and thus, such a model could perhaps inform future design decisions. After ﬁtting our method to the training data, our logistic regression model posted a mediocre performance on the testing set, registering an accuracy of 61.3% and a log-loss of 0.691 (compared to the naive model’s 61.4% test score). Grouping by individual pitchers (as seen in Tables

40 Figure 6: Scatter plot comparing the test performance of unregularized logistic regression to naive baseline predictions

5 and 6), we observe a few trends: namely, the list of high scores is almost entirely comprised of relievers that rely upon a single pitch type. Thus, rather than indicating that our model performs particularly well, the high accuracies instead reﬂect the existence of some pitchers who thrive despite limited decep- tion in their arsenals. As seen in Figure 6, when aggregated by pitcher (marker size corresponds to relative sample size), the test accuracies of the unregularized logistic regression model largely agreed with the individual’s baseline score, and while the model was able to outperform the naive guesses at lower levels, these methods produced nearly identical results on average.

While our model was successful in a number of instances, it is worth noting, however, that of the 45 predictors used in the original, only 7 were found to have coeﬃcients with an absolute value greater than 0.5 after optimizing for log-loss, perhaps indicating that a number of our features were uninformative.

While a high number of covariates were deemed to be statistically signiﬁcant

(likely because of the size of our dataset rather than an actual relationship to the

41 Pitcher Test Accuracy (%) Jacob Barnes 100.0 Alex Colome 98.4 Ryan Tepera 95.7 Nick Vincent 95.5 Kenley Jansen 94.4

Table 5: Pitchers with the highest test accuracies using unregularized logistic regression with full feature set

Pitcher Test Accuracy (%) A.J. Cole 38.3 Jonathan Holder 41.6 Paul Blackburn 46.5 Chad Bettis 47.1 Dan Altavilla 47.1

Table 6: Pitchers with the lowest test accuracies using unregularized logistic regression with full feature set response), it seems quite possible that a high proportion of the features used for the prediction problem were actually uncorrelated with the output, indicating that our original model may be unnecessarily parametrized. In order to assist with feature selection (hopefully eliminating useless covariates), we ﬁrst decided to turn to L1 regularization, frequently referred to as the LASSO method.

5.2.2 L1-Regularized Logistic Regression

As mentioned above, in order to identify an optimal feature set when developing models, research has noted that the L1 penalty is a particularly eﬀective method of doing so, as it naturally shrinks unnecessary coeﬃcients to 0 while minimizing its objective function [16]. The LASSO penalty for logistic regression can be represented via the following expression:

p L1(w) = L(w) + λΣj=1|βj|

Here, L(w) refers to Equation 1, the traditional log-loss function, and the

42 remainder of the expression corresponds to the penalty term, which can be reg- ulated by the hyperparameter λ. Larger values of λ more aggressively penalize coefficients with greater absolute values while simultaneously driving the parameters associated with uninformative predictors to 0, effectively eliminating them from the model [16]. Given the number of covariates in our original logistic regression model that seemed only lightly correlated with the fastball response variable, such regularization seemed reasonable, and thus, after training a new model on 2016 and 2017 data and cross-validating to find the optimal λ = 0.1, we evaluted this new model on the same held-out testing data. Unfortunately, the LASSO penalty had little to no effect on our performance, as the new logistic regression model produced a nearly identical log-loss score while seeing an increase in classification accuracy of less than a percentage point, to only 61.4%

(the same score achieved by the naive baseline). While our model could simply be failing to identify higher-order eﬀects of the features or interactions between covariates, given the limited impact regularization had on our performance, it seems reasonable to question whether the data available to us actually contains enough signal to generate accurate predictions, or if instead we are simply attempting to model an inherently random process, in which case our models are rendered rather useless.

5.2.3 Varying Coeﬃcients Logistic Regression

As outlined above, in addition to running a global logistic regression model, we hypothesized that a local model (also referred to in literature as a “varying coeﬃcients” model) may have been useful as well, given that we had already computed the pairwise Earth Mover’s Distance between all pitchers and subsequently used the unsupervised k-medoids algorithm to assign them to separate classes depending on their positioning in the space, relative to their peers [9].

While we were able to successfully train such a model, the localized nature of

43 Figure 7: Scatter plot representing the correlation between predictions made by the global and local regression models our predictions ultimately had little eﬀect upon our performance.

As seen in Figure 7, the predictions made by our varying coefficients model did resemble those of the global logistic model, though there were some interesting discrepancies, as evidenced by the dispersion of the points. Unfortunately, despite the differing individual predictions, in the aggregate, our local model produced nearly identical results when compared to those of the global one, producing the same classification accuracy (61.3%) with a similar log-loss (0.692 for the localized regression, 0.691 for the global model), again only matching the performance of the naive model (61.4% test accuracy). This would seem to indicate that while novel, our hypothesis that pitches thrown by individuals whose repertoires most closely resemble those of the pitcher in question provide no more information than those thrown by pitchers whose arsenals look nothing alike.

44 5.2.4 Linear Discriminant Analysis

Just as in the multi-class prediction case, after fitting some simple interpretable models, as well as defining a naive baseline, it seemed reasonable to continue our research by proposing some more sophisticated machine learning methods, ones that could perhaps better approximate the choices made by individual pitchers when preparing to attack opposing hitters. One of the first such models we chose to implement was Linear Discriminant Analysis (LDA) (previously outlined in Section 4.5.1), in which linear combinations of features are used to separate classes under strict assumptions. Unfortunately, LDA did not prove particularly helpful under this project’s circumstances, as the models produced a mean accuracy of only 59.3%; in contrast, the naive baseline (in which every test observation is predicted to be the majority class found in the training data) produced a mean score of 61.4%, suggesting that the LDA models actually performed worse than a simple intuitive guess on average. A common solution in such instances is to acquire more training data, though Figure 8 would suggest that linear discriminant analysis performed quite poorly for even those individuals whom we observed most frequently at training time. Once more, such a result provides compelling evidence seeming to indicate that the pitch selection process is inherently quite noisy.

5.2.5 Support Vector Machines

As described in Section 4.5.2, though not as interpretable as logistic regression, the support vector machine (SVM) was once considered a leader in the machine learning ﬁeld, and as such, has been used for a variety of applications, including pitch type prediction, as demonstrated by Guttag and Ganeshapillai’s previous research [8]. Thus, their inclusion in my own work was largely for the sake of replication; while numerous other methods have failed to capture meaningful

45 Figure 8: Scatter plot relating an individual pitcher’s test accuracy using LDA to the number of observations thrown by him in the training data signal in the available data, given that previous work has found SVMs to be suitable approximations, we would expect the support vector machine to be useful in this iteration of research as well.

Unfortunately, despite our best attempts to re-create the work of Guttag and Ganeshapillai as closely as possible, as outlined in Section 4.4, when using support vectors with linear kernels as binary classiﬁers, we were unable to re- cover their reported scores. The individually trained SVMs with cross-validated penalty parameters only achieved 61.6% test accuracy on our dataset (compared to the naive baseline of 61.4%), suggesting that the models actually did no better than the simplest guess imaginable. Such a ﬁnding would seem to suggest that the pitch prediction problem may simply be driven by randomness, though because the SVM has been surpassed by other more sophisticated methods in the world of machine learning, it seemed prudent to continue our research using newer techniques as well.

46 5.2.6 Random Forests

Unlike the approaches described above, random forests leverage the power of ensembling, or more speciﬁcally, “bagging,” as outlined in Section 4.5.4. Be- cause of the ease with which random forests can be implemented, and their ability to identify non-linear relationships between features and responses, as well as complex interactions between covariates, they are often recognized as the workhorse of modern machine learning, and thus, the performance of the random forest often exceeds that of other less sophisticated predictive models.

While our research indicates that this is true in this situation as well, the effect of ensembling bootstrapped classification trees with randomly subsampled feature splits is less pronounced than anticipated - even after training individual forests and cross-validating key hyperparameters (number of estimators, maximum tree depth, and minimum number of samples required in leaf nodes) across the entire population in question, we find that our collection of random forests achieves a mean accuracy of only 62.7%, a rather unimpressive raw score that is only further weakened by the backdrop of the naive model’s 61.4% accuracy. As observed in Figure 9, while the random forests do exhibit a small improvement over our baseline predictions (the mean increase in test accuracy is 1.33 percentage points), the skew of the distribution suggests that this boost is perhaps aided by some outliers, which is confirmed by examination of the median; when viewed through this lens, the random forests actually provide no improvement over our naive guesses. Given the random forest’s inability to extract any signal from our data, evidence rejecting the feasibility of even binary pitch type prediction continues to mount.

47 Figure 9: Histogram representing the improvement of individually trained and cross-validated random forest models over each pitcher’s naive baseline

5.2.7 Boosted Trees

Before uniformly concluding that all out-of-box machine learning algorithms fail to capture any signal in the training data, it seemed prudent to examine the performance of gradient boosted trees as well, whose motivation is described in

Section 4.5.5. Though seemingly never before applied to the pitch type prediction problem (our review of previous research on the topic uncovered no references to such models), there is growing conviction within the machine learning community suggesting that boosting may in fact be superior to bagging, though because of the additional computational resources required to accurately tune its hyperparameters (boosting operations cannot be parallelized), ensemble methods invoking bagging are often deemed superior [7]. Despite this, given the lack of exploration in this space, it seemed reasonable to apply gradient boosted trees as part of our project. Boosting did actually produce accuracy on par with the results of the random forest models (62.5% test score), though once again, this represented only a slight improvement over the baseline model (61.4% test

48 Model Average Test Accuracy (%) Linear Discriminant Analysis 59.3 Linear Support Vectors 61.6 Random Forest 62.7 Gradient Boosted Trees 62.5 Naive Baseline 61.4

Table 7: Performances of different models when predicting fastballs vs. non- fastballs accuracy), and thus, we are unable to conclude with any confidence that such an increase in accuracy can be attributed to signal detected by the models rather than noise in the testing data. As seen in Table 7, each of the machine learning methods considered produced rather similar results, just narrowly eclipsing the baseline accuracy, if at all. Such a finding suggests that pitch type selection may in fact be too random to accurately model, while also reinforcing the idea that new and more sophisticated machine learning methods are only as good as the data provided to them. Given that we were unable to significantly surpass the naive baselines set forth at the start of this research, even after the engineering of new features in addition to the application of novel techniques, we must confront the possibility that the pitch type prediction is simply unsolvable given the data available to us at this time, calling into question the findings previously put forth in this sphere.

5.3 Other Forms of Binary Classiﬁcation

In addition to performing binary classification to predict whether an upcoming pitch was more likely to be a fastball or non-fastball, given our struggles to identify signal within the data, it also seemed reasonable to consider other forms of binary classification, like breaking balls vs. non-breaking pitches, or offspeed pitches vs. non-offspeed ones. These pitch groups were mutually exclusive: fastballs were composed of four-seam fastballs, two-seam fastballs, sinkers, and

49 Breaking Pitch Oﬀspeed Pitch Model Test Accuracy (%) Test Accuracy (%) Linear Discriminant Analysis 68.1 83.0 Linear Support Vectors 70.4 85.0 Random Forest 72.0 86.3 Gradient Boosted Trees 71.8 86.1 Naive Baseline 71.6 86.3

Table 8: Performances of different models when predicting breaking balls and offspeed pitches cutters, breaking balls include curves, sliders, and slow curves, and all other pitch types (changeups, knuckleballs, and splitters) comprised the final offspeed grouping. Because we were unable to separate fastballs from non-fastballs in our first attempts at binary classification, it seemed unlikely that we would be able to identify offspeed pitches or breaking balls using the same feature set, but to ensure that we had not made any key oversights, the analysis was repeated for the other sub-categories.

Unfortunately, though unsurprisingly, as seen in Table 8, similar classification models trained to identify breaking balls and offspeed pitches once again failed to significantly outperform the naive baseline models, only just surpassing them, if at all. Relying upon the fact that our machine learning methods were unable to accurately predict the labels of test observations with any semblance of consistency, regardless of label, this would again seem to suggest that pitch type selection is unpredictable, in stark contrast with previous research. How- ever, given the wealth of evidence in our own work supporting such a hypothesis, it seems as if such a claim may stand on strong footing in this new analytical age of baseball.

50 Figure 10: Scatter plot displaying the relationship between pitcher predictability and Earned Run Average

5.4 Relating Predictability to Performance

In addition to simply studying the predictability of individuals, our group was also interested in exploring the possibility of a relationship between one’s predictability and his performance. A positive correlation would seem somewhat intuitive; namely, one might expect those pitchers deemed most predictable to perform more poorly than their peers who better mask their decisions, simply because batters may be capable of anticipating what pitch will be thrown. In a game that encodes nearly every signal to prevent opponents from “stealing signs,” it would seem that the ability to anticipate what a pitcher is likely to throw might represent an enormous competitive advantage. However, throughout the course of our research, we have noted that those pitchers for whom our models registered the highest test accuracies were high-leverage relievers relying upon a single dominant pitch type, such that hitters still struggled to ﬁnd success despite knowing what the pitcher was likely to throw.

Though our models performed rather poorly in the aggregate, we did observe

51 Figure 11: Scatter plot displaying the relationship between pitcher predictability and Fielding Independent Pitching

Figure 12: Scatter plot displaying the relationship between pitcher predictability and Expected Fielding Independent Pitching

52 varying degrees of success at the individual level, where we were able to achieve

100% test accuracy for one pitcher (Jacob Barnes), while on the other end of the spectrum, our binary fastball classiﬁer predicted only 38.3% of A.J. Cole’s test observations correctly. Despite this range of test accuracies, when observing scatter plots comparing individual’s performances as measured by common baseball statistics (ERA, FIP, and xFIP) to one index of predictability (found in Figures 10, 11, and 12), we ﬁnd no strong patterns in the data, though the overall trend of the scatter plots actually seems to suggest a negative correlation between the metrics - pitchers who are more predictable seem to post better ERAs/FIPs/xFIPs (where lower run estimators represent superior pitching performances). This would perhaps suggest that those pitchers who throw only a single pitch type are in fact so dominant with that one pitch that they can out-perform their more deceptive peers with full arsenals. However, given the weakness of this apparent correlation, the evidence would seem to suggest that a pitcher’s execution of his plan may be much more important than the choice of pitch type itself.

53 6 Conclusion

6.1 Overview of the Research

Inspired by past research considering the pitch type prediction problem, this project was designed to expand the scope of the ﬁeld by further investigating the possibility of multi-class predictions using more advanced statistical machine learning techniques. However, the nature of the conversation changed as attempts to re-create previously published work continued to fail: rather than creating a better predictive model, our focus instead turned towards simply making any model, one which could outperform the naive baselines set forth by other authors. Given the diﬃculties we encountered while attempting to replicate research and the apparent limited predictive power of our expansive feature set, we are forced to question whether pitch type selection can actually be predicted, as others have suggested, or if it more closely resembles a truly random process.

One of the main ideas put forth by this work was the possibility that a pitcher’s movement proﬁle may relate to his pitch sequencing as well. While somewhat of an unfounded hypothesis, such a suggestion does not seem that outlandish upon ﬁrst glance: pitchers are known for their cerebral natures, and one line of reasoning would stand to suggest that individuals may develop plans for attacking hitters by studying how their peers have done so in past meetings.

However, pitchers are unique - they can throw with diﬀerent hands, possess contrasting arsenals, and have varying levels of conﬁdence in their abilities.

Thus, rather than weighting all pitchers equally when studying their peers’ tendencies, we hypothesized that individuals may instead focus their eﬀorts on only those who most closely resemble themselves. Unsurprisingly, similar theories have been proposed in the ﬁeld of statistical machine learning; the use of kernel methods to identify nearby neighbors is quite common, and thus, our

54 problem seemed like a natural application for such techniques.

In order to use a kernel method to appropriately weight observations, it was ﬁrst necessary to identify some measure of similarity among pitchers, such that a trained model could recognize analogous individuals. While a number of features could have been used to generate such a metric, we developed a rather simplistic one: rather than deﬁning a pitcher by his performance, or by somewhat subjective pitch type labels, we can instead dive deeper, using the physical attributes of his pitches provided by Statcast as a proxy for his identity.

More speciﬁcally, every pitch thrown by an individual can be represented in three-dimensional space, along the axes of release velocity (measured in miles per hour), horizontal break, and vertical movement (both measured in inches), and thus, as we track more pitches and these points accumulate, all of a pitcher’s observations begin to resemble a cloud. At this juncture, the similarity between two clouds can be measured via the Earth Mover’s Distance algorithm, which was developed in part to compare distributions in probability space.

After computing the Earth Mover’s Distance between every pitcher and each of his peers in the training data, the entire population can be represented using a pairwise distance matrix D, in which entry Di,j = Dj,i represents the distance between pitcher i and pitcher j as calculated via the algorithm. Now, with a measure of similarity across all pitchers, it is possible to weight individuals more or less heavily depending upon how favorably they compare to the pitcher in question. However, the family of nearest-neighbor models can be quite computationally intensive, as each test observation requires a search through the entire training set, such that those most similar points can be identiﬁed. Rather than undertaking such a task, it was determined that unsupervised learning could be used to cluster pitchers based on this pairwise distance matrix, limiting the amount of computational overhead needed to complete the project while only

55 exhibiting minimal eﬀect on the performance of our classiﬁcation algorithms.

While the k-means clustering algorithm is perhaps the most popular form of unsupervised learning because of its intuitive nature and ease of implementation, it is unfortunately unsuitable for our problem, as the k-means approach requires input observations that can be represented as features in space, while our pairwise distance matrix only encodes the relative location of each pitcher among the population. Thus, rather than relying upon the k-means algorithm, we instead implemented the k-medoids approach, in which cluster centroids are limited to only observed training points, rather than being allowed to exist any- where in the feature space. With n = 425 pitchers in our sample, we chose k =

50 for our clustering, such that each cohort would contain about 8 individuals on average, signiﬁcantly diminishing the amount of computational power needed to provide localized kernel predictions depending upon cluster membership.

6.2 Implications of the Results

After identifying previous work done in the field suggesting that pitch sequencing could be modeled using statistical machine learning, replicating these results prior to shifting our focus to improvements upon their performance appeared to be a reasonable plan of action at the outset of this thesis. However, after initial attempts struggled to surpass the scores of the naive baseline models in the k-classification setting, let alone challenge those posted by previous researchers, it became necessary to investigate what was causing such a discrepancy. There could have been flaws in our implementations of supervised learning models, we could have lacked the necessary features to capture the signal of the data, previous authors may have overfit to their training data and erroneously reported high scores that went undetected because of a lack of peer-review, or perhaps most interestingly, the data-generating process concerning the pitch types thrown by

56 Count Test Sample Size Test Accuracy (%) 0-0 147,003 64.8 0-1 74,156 44.8 0-2 37,814 50.0 1-0 57,126 63.1 1-1 59,395 54.6 1-2 55,844 51.3 2-0 19,543 75.3 2-1 31,010 63.1 2-2 48,398 52.9 3-0 5,947 95.7 3-1 12,786 81.1 3-2 29,045 64.8 All 578,067 58.7

Table 9: Test accuracies of binary predictions made when considering only the ball-strike count

Major League Baseball pitchers could have undergone a massive transformation in the years separating the publication of earlier work and our attempts at replication nearly a decade later, using more recent data. Despite our best efforts to duplicate the findings of those who had previously produced k-class labels for upcoming pitch types (including thorough testing of code and e-mails sent to the original authors), our trained and cross-validated models unfortunately yielded little in the way of success, perhaps indicating that our initial pursuits may have been too aggressive. Then, in hopes of simplifying the problem, rather than producing k-class predictions, the scope of the project was instead limited to binary classification, where upcoming pitch types would instead be labeled as either fastballs or non-fastballs.

Even after this simplifying assumption, our models unfortunately continued to underwhelm, once again only narrowly surpassing the naive baseline, if at all. Given the high number of pitch types in the k-classification setting, failure to achieve impressive predictive accuracy is not particularly surprising, as professional hitters, coaching staffs, and front offices have informally wrestled with

57 this question for decades, seemingly producing little in the way of meaningful results. However, when considering only fastballs vs. non-fastballs, conventional baseball wisdom would suggest that models could identify meaningful relationships within the data. When batters get ahead in the count (i.e. when the number of balls exceeds the number of strikes, signaling a favorable situation for the hitter), they are often taught to “sit on a fastball,” stemming from the belief that a pitcher is likely to throw the pitch most likely to earn him a strike when he falls behind. However, our research here indicates that such an adage is not necessarily true, as even our best models were unable to out-perform naive guessing. In fact, as seen in Table 9, if we were to generate predictions based strictly on the ball-strike count at the time of the pitch (predict all fastballs when the batter is ahead or the count is even, predict all non-fastballs when the pitcher is ahead), we would correctly identify only 58.7% percent of test observations. In light of the performance of our naive model (61.4% test accuracy), in which we simply predicted an individual’s most commonly thrown pitch type for each of his observations, it seems that “sitting on a fastball” based on the count is no more useful to batters than simply “sitting on” the pitcher’s most common pitch. After failing to accurately identify fastballs vs. non-fastballs, we continued our research by exploring other binary predictions, including breaking vs. non-breaking balls and offspeed vs. non-offspeed pitches, yet in every iteration, the same result held. Under no circumstances could our out-of-box methods achieve the scores reported by previous researchers, and even after leveraging the power of kernel methods via the use of the Earth Mover’s Distance and clustering algorithms, our findings continued to point to the same conclusion - the pitch selection process may be too complex for our models.

58 6.3 Final Thoughts

While in one sense, our research is disappointing, as we were unable to build upon previously published results, when viewed in another light, our work stands alone. Rather than adding to the cacophony of voices suggesting that statistics and machine learning can be used to analyze every aspect of sports, our

findings push back on that notion, as the results indicate that naively guessing an individual’s most commonly thrown pitch type is often more accurate than the output of more sophisticated learning algorithms. Starting with the 2003 publication of Michael Lewis’ Moneyball, in which front office analysts leveraged their understanding of on-base percentage and its relationship to wins in order to construct a team greater than the sum of its parts, fans and executives alike have been constantly searching for the next great competitive advantage in baseball, hoping to discover a new revolutionary tactic [11]. While early returns on the pitch prediction problem suggested that this sphere remained ripe for exploration, our own findings damper this optimism, instead suggesting that researchers’ efforts may be better spent elsewhere.

Rather than deﬁnitively rejecting the claims of previous authors, we instead hope that our research inspires conversation about the feasibility of pitch type prediction, and as more data becomes publicly available, we hope that our results, as well as those of others, will be subject to peer-review in order to resolve some of the mystery surrounding pitch sequencing. While in this instance, we conclude that pitcher behavior cannot be approximated using statistical models, we remain hopeful that the continued application of machine learning within baseball can change the game for the better. As time passes, we hope that fans, analysts, and players alike can join together to produce meaningful work, fur- thering the eﬀorts of Major League Baseball to make America’s favorite pastime more accessible to all, regardless of the outputs of pitch type prediction models.

59 References

[1] Abe, Shigeo. “Multiclass Support Vector Machines.” Support Vector Ma-

chines for Pattern Classiﬁcation Advances in Pattern Recognition, 2010,

pp. 113–161., doi:10.1007/978-1-84996-098-43.

[2] Andoni, Alexandr, et al. “Earth Mover Distance over High-Dimensional

Spaces.” ECCC, 11 Oct. 2007.

[3] Beaver, Zach, et al. “The Solution.” Pitch Prediction.

[4] Bock, Joel. “Pitch Sequence Complexity and Long-Term Pitcher Perfor-

mance.” MDPI, 2 Mar. 2015.

[5] Breiman, Leo. “Random Forests.” Machine Learning, vol. 45, no. 1, 2001,

pp. 5–32., doi:10.1023/a:1010933404324.

[6] Fast, Mike. “What the Heck Is PITCHf/X?” The Hardball Times Annual,

2010.

[7] Freund, Yoav, and Robert Schapire. “A Short Introduction to Boosting.”

Journal of Japanese Society for Artiﬁcial Intelligence, Sept. 1999, pp.

771–780.

[8] Ganeshapillai, Gartheeban, and John Guttag. “Predicting the Next Pitch.”

Sloan Sports Analytics Conference, May 2012.

[9] Hastie, Trevor, et al. The Elements of Statistical Learning: Data Mining,

Inference, and Prediction: with 200 Full-Color Illustrations. Springer, 2004.

[10] Hoang, Phong. “Supervised Learning in Baseball Pitch Prediction and Hep-

atitis C Diagnosis.” North Carolina State University, 2015.

[11] Lewis, Michael. Moneyball: the Art of Winning an Unfair Game. Law

Press, 2011.

60 [12] Mika, S., et al. “Fisher Discriminant Analysis with Kernels.” Neu-

ral Networks for Signal Processing IX: Proceedings of the 1999

IEEE Signal Processing Society Workshop (Cat. No.98TH8468),

doi:10.1109/nnsp.1999.788121.

[13] Rubner, Y., et al. “A Metric for Distributions with Applications to Image

Databases.” Sixth International Conference on Computer Vision.

[14] Sidle, Glenn, and Hien Tran. “Using Multi-Class Classiﬁcation Methods to

Predict Baseball Pitch Types.” North Carolina State University, 2015.

[15] Smith, David W. “Do Batters Learn During a Game?” Retrosheet, 7 June

1996.

[16] Tibshirani, Robert. “Regression Shrinkage and Selection via the Lasso: a

Retrospective.” Journal of the Royal Statistical Society: Series B (Statis-

tical Methodology), vol. 73, no. 3, 2011, pp. 273–282., doi:10.1111/j.1467-

9868.2011.00771.x.

[17] Velmurugan, T. “Computational Complexity between K-Means and K-

Medoids Clustering Algorithms for Normal and Uniform Distributions

of Data Points.” Journal of Computer Science, vol. 6, no. 3, 2010, pp.

363–368., doi:10.3844/jcssp.2010.363.368.

[18] Zhao, Shiyuan, et al. “Measuring Pitcher Similarity.” Baseball Prospectus,

10 July 2017.