When to Pull Starting Pitchers in Major League Baseball? a Data Mining Approach

When to Pull Starting Pitchers in Major League Baseball? A Data Mining Approach

Michael Woodham, Jason Hawkins, Ankita Singh, Shayok Chakraborty Department of Computer Science, Florida State University

ABSTRACT categories: (i) the starting pitcher and (ii) relief pitchers, who re- One of the most important decisions made by managers in a baseball place the starting pitcher. The starter is expected to pitch deep game is when to pull the starting pitcher. It has a direct consequence into the game and it is common for starting pitchers to throw 100 1 on the outcome of the game and also on the physical fitness of or more pitches in a game (as per our curated data, the starting the pitcher. Traditionally, managers rely on various heuristics for pitcher pitched for approximately 6.2 − 6.3 innings on an average). this decision. In this paper, we propose a machine learning based When to replace the starting pitcher is often the most important in- approach to determine when to replace the starting pitcher. We game decision a manager has to make. This is due to the following curate a large dataset of more than one million samples, spanning reasons: more than 10 years of baseball games (2007 - 2017), and study the • If a good pitcher is replaced too early, the team may end up performance of various classification algorithms on this dataset. losing the game. However, throwing a perfect baseball pitch We further perform feature analysis to gain insights on the most is a strenuous activity and requires a specific body posture. important features influencing the replacement of starting pitchers. Depending on the number of pitches thrown in a game, a To the best of our knowledge, this is the first research effort to starting pitcher in professional baseball typically needs a leverage machine learning and data analytics to model managers’ specific duration of mandatory rest, before pitching another. decisions of pulling a starting pitcher from historic data. Such a Major League Baseball (MLB) teams play 162 games in a system can be immensely useful in assisting managers make more given season. Thus, it is critical to decide the best time to pull informed decisions during an ongoing game and has the potential out a starting pitcher, so that he is fit to play in subsequent to reduce the risk of baseball related injuries. We hope that our games and can deliver his best performance. curated dataset and initial research findings will promote further • If the starting pitcher is left in the game for too long, the work toward this important problem of deciding when to pull a opponent team can potentially get familiar to his pitching starting pitcher in an ongoing baseball game. style after multiple at-bats and in-person observations, and as a result, can gain a higher probability to score. CCS CONCEPTS • If a pitcher throws too hard for a long time, a serious elbow • Computing methodologies → Supervised learning; • Ap- or shoulder injury may be on the horizon. Damage or tear plied computing; to the ulnar collateral ligament (UCL) is the most common injury suffered and is often caused by pitchers throwing KEYWORDS too much. This ligament is the main stabilizer of the elbow for the motions of pitching, and can be difficult to repair Machine Learning for Baseball, Class Imbalance and rehabilitate if damaged 2. Other injuries include muscle ACM Reference Format: strains, labral tear, rotator cuff injuries, shoulder instability Michael Woodham, Jason Hawkins, Ankita Singh, Shayok Chakraborty. and thrower’s elbow 3. Of all the MLB pitchers who threw 2019. When to Pull Starting Pitchers in Major League Baseball? A Data 95 mph or harder on average in 2017, 80% of them appeared Mining Approach. In ,. ACM, New York, NY, USA, 9 pages. https://doi.org/ on the disabled list at some point during the season 4. These 10.1145/1122445.1122456 serious injuries have a debilitating effect on the lives of these athletes and often result in the early termination of their 1 INTRODUCTION sports careers. Good pitching is extremely critical in baseball and the pitcher is Today, managers rely upon various heuristics of an ongoing often considered the most important player on the defensive side game to decide when the starting pitcher should be relieved. In of the game (with baseball being the only American sport where this paper, we use data mining algorithms to derive a model that the defense is in possession of the ball). Pitchers can be of two can learn from historic data and past decisions made by managers regarding the replacement of starting pitchers; such a model can Permission to make digital or hard copies of all or part of this work for personal or assist managers in making decisions in a much more principled way classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation during an ongoing game. We make a considerable effort in engineer- on the first page. Copyrights for components of this work owned by others than the ing a rich set of features (including information about the current author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or pitcher, batter, venue of the game etc.) for more accurate modeling republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. Applied Data Science Track Paper, KDD 2019 1Starting Pitcher Statistics © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. 2UCL Injuries ACM ISBN 978-1-4503-9999-9/18/06...$15.00 3Other Baseball-related Injuries https://doi.org/10.1145/1122445.1122456 4MLB 2017 Applied Data Science Track Paper, KDD 2019 Michael Woodham, Jason Hawkins, Ankita Singh, Shayok Chakraborty and prediction of a manager’s decision of pulling a starting pitcher. with improved performance results. Attarian et al.[2, 3] studied The contributions of this paper are outlined below: the performance of k-nearest neighbors and naive Bayes classifiers • We curate a large dataset of more than one million samples to classify pitches based on different characteristics like spin rate containing various statistics of baseball games spanning the and velocity of the pitch. years 2007 - 2017 and the corresponding manager decisions Game related variables: The focus of these methods is mostly of whether the starting pitcher was pulled or not (further to predict the final outcome (win / lose) of a particular game. Valero details are provided in Section 3). This dataset is the first of [44] studied four machine learning algorithms to predict the win- its kind in terms of the number of samples, the number of ner of a baseball game: (i) k-nearest neighbors, (ii) artificial neural extracted features and the number of years covered. It will networks/Multi-Layer Perceptron, (iii) Decision Trees, and (iv) be made publicly available to foster further research on this SVM and concluded that the SVM approach was the most success- topic. ful, with prediction accuracy of just under 60%. SVMs were also • We present a comparative analysis of a number of classifica- exploited by Tolbert and Trafalis [42] to predict the winners for the tion algorithms on this dataset, where the goal is to predict American League Championship Series, the National League Cham- whether or not to relieve the starting pitcher at a specific pionship Series, and the World Series in the MLB. Everman created point in an ongoing game, given the characteristics (features) his own statistic, referred to as Calculated Aggregate Value (CAV), of the game. We further perform feature analysis to gain in- to predict the winner of a playoff series [10]. Yang and Swartz com- sights on the most important factors influencing the removal bined home field advantage with past performance, batting ability, of pitchers. and starting pitchers in a two-stage Bayesian model to predict the winner of a game [45]. While previous research has focused mostly on predicting the Player related variables: These algorithms attempt to predict result of a game, evaluating the player’s contributions to wins or player-specific attributes, such as predicting their performance over losses, our work is the first in using data analytics to decide when a particular season, their batting average in a particular season and to replace the starting pitcher. The proposed system can be used so on. Ganeshapillai and Guttag [16] studied the problem of pre- by managers to make more informed decisions during an ongoing dicting a pitcher’s performance in the next inning; using multi-task game, and has the potential to reduce the risk of baseball related learning, the authors developed pitcher specific prediction models injuries. The rest of the paper is organized as follows: we present that can be used to estimate whether the team will give up at least a review of related work in Section 2; the details of the curated one run if the starting pitcher is allowed to start the next inning. dataset and features extracted are presented in Section 3; Section This research was focused on predicting the performance of the 4 details our empirical study and we conclude with discussions in pitcher in the next inning; in our work, we analyze pitch-by-pitch Section 5. data to determine the optimal time to replace the starting pitcher. Fichman and Fichman [11] used regression analysis and showed 2 RELATED WORK a decline in a player’s batting average over his career. The results Baseball analytics using machine learning is an emerging research aligned with common knowledge that athletes tend to perform at a field [34]. Existing research in this area can be classified into three lower level as they age. Similar research has also been performed groups based on the specific variables that are being predicted: (i) by Stevens in attempting to model a pitcher’s strikeouts and walks pitch related variables; (ii) game related variables; and (iii) player as they age [40]. Healey [22, 23] also used regression analysis to related variables. These are detailed below: assess the probability of a strikeout given a specific matchup be- Pitch related variables: These algorithms attempt to predict tween a pitcher and a batter. Tung developed the Offensive Player attributes related to the pitch, such as the speed, type of pitch etc. Grade (OPG) metric to measure a player’s offensive performance, Ganeshapillai and Guttag [15] used a linear SVM model to predict while ignoring his defensive statistics [43]. Lyle used SVMs and whether a pitcher’s next pitch will be a fastball or not. The pitches artificial neural networks to predict several offensive statistics that from the 2008 season were used for training to predict the pitches are used to evaluate a player’s offensive prowess, such as runs, of the 2009 season and an accuracy of 70% was achieved. Hoang et RBIs, hits, triples, and doubles [37]. Bayesian analysis has been al.[25, 26] presented a comparative analysis of three classification exploited to build models for pitchers and batters, which could be algorithms - k-nearest neighbors, SVMs and linear discriminant used to predict their performance for the rest of the season [24, 28]. analysis (LDA) for the same problem of classifying pitches into Jensen et al.[31] used hidden markov chains and Gibb’s sampling the fastball and non-fastball categories. Hamilton et al.[18] further to assess the evolution of hitting performance over a player’s career. improved the accuracy by introducing feature selection into the Using MLB data from 2005 as training, Jiang et al. used Bayesian prediction problem. Fastball and non-fastball pitches can be further models to predict a player’s batting average in the 2006 season [32]. categorized by pitch type; fastball pitches can be subdivided into Ishii [29] used the k-means and hierarchical clustering algorithms cutters, two-seam fastballs, four-seam fastballs while non fastball to determine undervalued players and classified them based on pitches can be classified as curve balls, change ups, or sliders. Sidle pitch type and repertoire. The objective was to find players whose [39] studied the performance of SVMs, LDA and bagged random earned run average (ERA) was higher than their cluster ERA, which forest models to classify pitches. Random forest depicted the best represented an average ERA for players of that skill level. Players performance in terms of classification accuracy while LDA was who fit this criterion were deemed to be undervalued. A similar more efficient in terms of computation time. Along similar lines, analysis of identifying undervalued and overvalued players was Bock [5] used multinomial logistic regression to classify pitch type, performed by Barnes and Bjarnadottir [4] using regression trees When to Pull Starting Pitchers in Major League Baseball? A Data Mining Approach Applied Data Science Track Paper, KDD 2019 and gradient boosted trees. Freiman demonstrated the feasibility of existing minority samples. For a subset Sminority of the minor- using Random Forests to predict a player’s election to the Baseball ity class, the k-nearest neighbors are considered for each sample Hall of Fame [12]. Researchers have also addressed the problem of xi ∈ Sminority , for some integer k. To create a synthetic sample predicting whether a particular player in South Korea will be asked xsyn, one of the k-nearest neighbors is selected at random and the to join the South Korean national baseball team [30]. Das and Das corresponding feature vector difference is multiplied by a random used a neural network to analyze which aspects of a ball in flight number between [0, 1] and added to xi : contribute most to a fielder’s ability to catch9 it[ ]. Their results indicated that towards the end the ball’s flight, ball velocity became xsyn = xi + (xˆi − xi ) ∗ ρ (1) more impactful to catch probability than elevation angle, which where xˆ is one of the k nearest neighbors of x and ρ ∈ [0, 1] is a was still a large contributor. i i random number. As evident from this survey, existing research has primarily focused on predicting pitch, game or player related attributes; no work has focused on the pitch by pitch analysis to decide the optimal Percentage of 0 and 1 Labels time to replace the starting pitcher. In this paper, we propose a data driven framework to address this important challenge. We describe 100 our data collection and feature engineering process in the next section, followed by the results of our empirical studies. 80 Percentage of 1 Labels Percentage of 0 Labels 3 DATASET AND FEATURE ENGINEERING 60

Since there are no publicly available datasets with the informa- Percentage 40 tion we desire, we decided to create our own. We implemented a web-scraper using Python and used it to crawl the website baseball- 20 reference.com. This website contains the complete statistics for current and historical baseball players, teams, scores and leaders. 0 The website was crawled to extract information about each indi- 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Year vidual game played by each team from 2007 to 2017. In total, our database contains 1, 365, 204 samples, where each sample corre- Figure 1: Bar graph showing the percentage of 0 and 1 labels sponds to the pitches faced by a particular batter in a particular in the data year by year. Best viewed in color. game, before he moves base. The dataset will be made publicly available upon acceptance of our paper. Feature engineering refers to the process of using domain knowledge of the data to create features that best represent the underlying 4.2 Model Review problem, resulting in improved model accuracy on unseen data. In this section, we present the classification models studied in this re- We extracted a set of 36 informative features to train our classifica- search. For the sake of reproducibility, we also detail the parameter tion models, which are reported in Table 1. We also include brief combination used to train each model. Let {(x ,y ), (x ,y ),... (x ,y )} intuitive justifications on the usefulness of each feature in the table. 1 1 2 2 N N be a set of N training samples, where x is the feature and y is the Label Assignment: The ground truth labels were assigned using i i label of data sample i. the pitcher and batter information available in the website. Each k-nearest neighbors: The nearest neighbor model classifies a data sample was provided with a 0 or 1 label as follows: if the pitcher test sample based on a majority vote on the labels of its k closest has made it through a batter and will continue to throw to the next training samples. batter, the sample is labeled as 0; if the batter which the pitcher Logistic Regression: The logistic regression is a well-known faced is the last batter that the pitcher faces in the game, that is, statistical model for probabilistic classification. Given a test sample the pitcher was replaced after that batter, the sample was assigned x, a binary logistic regression models the conditional probability of a label of 1. the class label y ∈ {+1, −1} as:

1 4 EXPERIMENTS AND RESULTS p(y = ±1|x,w) = (2) (− T ) 4.1 Handling Class Imbalance 1 + exp yw x where w is the model parameter. The model parameters are trained We analyzed the class labels in our data year by year and the results by maximizing the log-likelihood of the training data L through a are depicted in Figure 1. It is evident that there is a strong class regularized framework: imbalance in the data; for each year’s data, only about 4%−5% of the samples belong to class 1. To address this issue, we leveraged the Õ T λ T synthetic minority oversampling technique (SMOTE) algorithm [7], min log(1 + exp(−yiw xi )) + w w (3) w 2 which has depicted impressive performance to learn from imbal- i ∈L anced datasets. The SMOTE algorithm creates artificial examples of Neural Network: The artificial neural network model was in- the minority class based on the feature space similarities between spired by biological processes in which the connectivity between Applied Data Science Track Paper, KDD 2019 Michael Woodham, Jason Hawkins, Ankita Singh, Shayok Chakraborty

Number Feature Definition Intuition 1 Strike Count Total number of strikes a pitcher has thrown Contributes to overall pitcher accuracy 2 Ball Count Total number of balls a pitcher has thrown Detracts from overall pitcher accuracy 3 Pitch Count Total number of pitches thrown Contributes to endurance 4 Strike to Ball Number of strikes divided by number of balls thrown Gives us an efficiency score Ratio 5 Outs Number of current outs Indicates where in the game we stand 6 Inning Current inning " 7 Slugging Cur- Slugging percentage of the current batter Gives us information on the quality of the batter rent 8 OBP Current On base percentage of the current batter " 9 OPS Current On base percentage plus slugging of the current batter " 10 Slugging Aver- Slugging percentage average of the current batting Gives us information on what the pitcher has faced age team throughout the game, and how the current batter dif- fers from these notions 11 OBP Average On base percentage average of the current batting " team 12 OPS Average On base percentage plus slugging average of the cur- " rent batting team 13 Runs Number of runs given up by the pitcher Contributes to efficiency 14 Hits Number of hits given up by the pitcher Detracts from overall efficiency 15 Walks Number of walks by the current pitcher Detracts from overall efficiency 16 Strikeouts Number of strikeouts by the current pitcher Contributes to efficiency 17 Home Runs Home runs conceded by the current pitcher Detracts from pitcher morale and efficiency 18 On-base Current positions of on-base runners Gives us the current game scenario 19 Home / away 0 if the team is home, 1 if away Venue of the game, contributes to rest time 20 Park League 0 for National League, 1 for American League Contributes to game strategy 21 Batter Contact Current batter’s contact ratio: contact / pitches Gives us an indication of the batter’s ability to put Percentage thrown “bat-on-ball" (BCP) 22 BCP Average BCP average over the whole game " 23 Game No. Game no. for the pitching team on the season Indicates in-season scenarios and motivation 24 Team Losses Team losses for the pitching team on the season " 25 Team Wins Team wins for the pitching team on the season " 26 Team Win Loss Team Wins / Game Number " Percent 27 Slugging Next The slugging percentage of the next batter, 0 if the Indication of the quality of the next batter game is over 28 OBP Next The OBP percentage of the next batter, 0 if the game " is over 29 OPS Next The OPS percentage of the next batter, 0 if the game " is over 30 Friendly Score Score of the pitching team Performance of the pitching team 31 Pitcher’s Team Integer to indicate the team Indication of ties between pitchers 32 Opposing Team Integer value to indicate the opposing team Indication of ties between pitchers and opposing team 33 Year Year the game was played Gives a context of timeline 34 Stadium Loca- Integer to indicate where the game took place Context of field indicators tion 35 Winning 1 if the pitching team is winning, 0 if it is tied or losing Context of the current status of the game 36 Batter No. No. of batters the pitcher has faced in the game Indication of pitcher’s efficiency Table 1: Features Extracted from the Data

neurons mimics the organization of the human visual cortex. It con- bootstrapping. It predicts the class that is the mode of the classes sists of an input layer, an output layer and multiple hidden layers predicted by the individual trees. to capture non-linear relationships. AdaBoost: Adaptive boosting (AdaBoost) fits a sequence of Random Forest: Random forest is an ensemble learning algo- weak learners iteratively on the training data; all the predictions are rithm, which consists of a multitude of decision trees, where each then combined through a weighted summation to produce the final tree is trained on a random subset of the training data through prediction [19]. A weight is computed for each training sample, When to Pull Starting Pitchers in Major League Baseball? A Data Mining Approach Applied Data Science Track Paper, KDD 2019 which denotes its probability of it being selected for training in the next boosting round (initially, all samples are assigned equal ∂L(yi , F(xi )) weight). If a training sample is mis-classified by the learner in the yˆim = (12) ∂F(xi ) current iteration, it will have its weight increased for the next iteration. As the boosting iterations proceed, samples that are difficult Then, given the current estimate of h(x;θm), the optimal value of to classify will receive increased weightage; the subsequent weak the coefficient βm is computed as: learners will thus be forced to focus on the samples that were in- correctly classified in the previous rounds. Formally, in boosting N Õ round j, the error rate of the weak learner Cj is computed as: βm = arg min L(yi , Fm−1(xi ) + βh(xi ;θm)) (13) β i=1 N 1 Õ ϵ = w δ(C (x ) , y ) (4) The problem thus reduces to a least squares optimization, fol- j N i j i i i=1 lowed by a single parameter optimization based on a general loss where wi denotes the weight of the training sample xi and δ(.) is criterion L. an indicator function whose value is 1 if the argument is true and 0 XGBoost: The eXtreme Gradient Boosting (XGBoost) is an effi- otherwise. The importance of the weak learner is then computed cient and scalable implementation of gradient boosting framework as: [8]. The scalability of XGBoost is due to several important systems and algorithmic optimizations, including a novel tree learning 1 1 − ϵj algorithm for handling sparse data and a theoretically justified αj = log (5) 2 ϵj weighted quantile sketch procedure for handling instance weights Given the importance, the weights of the training samples for in approximate tree learning. Parallel and distributed computing the next round j +1 are computed as (Z is a normalization constant): make learning faster which enables quicker model exploration. LightGBM: LightGBM is another efficient implementation of j ( j+1 wi exp(−αj ), if Cj (xi ) = yi gradient based boosting algorithms [33]. It employs two novel wi = (6) strategies to reduce the computational complexity: (i) Gradient- Z exp(αj ), if Cj (xi ) , yi based One-Sided Sampling (GOSS), where only data samples with For a given test sample x, the prediction made by each classifier is significant gradient values are retained; others are excluded. Since weighted according to its importance to derive the final prediction data samples with large gradient values contribute more to the (T is the number of boosting rounds): information gain, GOSS can obtain an accurate estimate of the gain T with a much smaller sample size; and (ii) Exclusive Feature Bundling ∗ Õ (EFB), which bundles mutually exclusive features to reduce the C (x) = arg max αjδ(Cj (x) = y) (7) y j=1 number of features. While finding the optimal bundling is an NP- GradientBoost: Similar to AdaBoost, GradientBoost [13, 14] hard problem, a greedy algorithm can achieve a good approximation combines a set of weak learners into a single strong learner in an ratio and can thus substantially reduce the number of features, iterative manner. The target function is approximated as: without compromising much on the accuracy. Deep Neural Networks: Recently, deep neural networks have M revolutionized the field of machine learning; these models auto- Õ F(x) = βmh(x;θm) (8) matically learn a discriminating set of features from the data and m=0 have depicted commendable empirical performance in a variety where βs denote the combination weights and h(x;θ) are the base of applications [21, 35, 41]. A typical deep learning model (like a learners with parameters θ = {θ1,θ2,... θk }. The coefficients β CNN) contains the following types of layers: (i) input layer to hold and the parameters θ are jointly fit to the training data in a forward the raw values of the data; (ii) convolution layer to compute the stage-wise manner by solving the following optimization problem: output of neurons that are connected to local regions in the input, by computing a dot product between their weights and a small N Õ region they are connected to in the input; (iii) ReLU layer, which ( ) ( ( ) ( )) βm,θm = arg min L yi , Fm−1 xi + βh xi ;θ (9) applies an element-wise activation function, such as the max(0, x), β,θ i=1 thresholding at zero; (iv) pool layer to perform a down-sampling op- and eration along the spatial dimensions; and (v) fully-connected layer to compute the class probabilities. Since we are not dealing with ( ) ( ) ( ) Fm x = Fm−1 x + βmh x;θm (10) images in this research, we did not include any convolution layers Gradient boosting solves (βm,θm) for arbitrary loss functions L(y, F(x)) and used a deep network with 4 fully connected layers with the with a two step procedure. The function h(x;θ) is first fit by least rectified linear unit activation function. squares:

N 4.3 Reproducibility Õ 2 θm = arg min {yˆim − ρh(xi ;θ)} (11) In this section, we present the detailed parameter settings used to θ,ρ i=1 train each of our models, in order to allow reproducibility of our to the current pseudo-residuals: results. The parameter settings are reported in Table 2. Applied Data Science Track Paper, KDD 2019 Michael Woodham, Jason Hawkins, Ankita Singh, Shayok Chakraborty

Model Parameter Settings SMOTE Number of nearest neighbors to used to construct synthetic samples: 5; Sampling strategy: Equalize the number of samples in the two classes k-nearest neighbors Number of nearest neighbors: 9; Distance metric: Euclidean distance Logistic Regression Regularization: L2; Inverse of regularization strength, C: 1; Tolerance for stopping criteria: 0.0001; Maximum number of iterations for con- vergence: 100 Neural Network Number of hidden layer: 1; Number of neurons in hidden layer: 100; Learning rate: 0.001; Number of epochs: 200; Activation function: “Relu"; Batch size for training: 200; Momentum: 0.9; optmizer/solver: adam Random Forest Number of trees: 100; Split criterion: “Gini" AdaBoost Base estimator: Decision tree classifier of depth 1; Maximum number of boosting rounds: 50; Learning rate: 1 GradientBoost Base estimator: Decision tree classifier of depth 1; Maximum number of boosting rounds: 100; Learning rate: 0.1 XGBoost Base estimator: Decision tree classifier of maximum depth 3; Maximum number of boosting rounds: 100; Learning rate: 0.1 Light GBM Base estimator: Decision tree classifier of depth 1; Maximum number of boosting rounds: 200; Learning rate: 0.005 Deep Neural Network Number of hidden layers: 4; Number of neurons in each hidden layer: 256, 128, 128 and 64 respectively; Activation function: “Relu"; Optimizer: Adadelta; Learning rate: 0.05 Table 2: Model Parameters used for Training

Model TP FP TN FN Accuracy (%) F-score (%) k-nearest neighbors 10, 752 47, 893 303, 049 3, 510 85.92 29.49 Logistic Regression 12, 836 54, 004 296, 938 1, 426 84.82 31.65 Neural Network 12, 903 56, 815 294, 127 1, 359 84.07 30.72 Random Forest 6, 217 3, 056 347, 886 8, 045 96.96 52.83 AdaBoost 8, 587 13, 486 337, 456 5, 675 94.75 47.26 GradientBoost 6, 928 4, 688 346, 254 7, 334 96.70 53.54 XGBoost 11, 180 30, 146 320, 811 3, 067 90.90 40.23 Light GBM 9, 413 12, 573 338, 384 4, 834 95.23 51.95 Deep Neural Network 7, 772 8, 095 342, 862 6, 475 96.01 51.61 Table 3: Performance analysis. TP: True Positive, FP: False Positive, TN: True Negative, FN: False Negative. Accuracy and F- score are denoted in percentages. The best results are shown in bold.

4.4 Results number of boosting iterations etc.) can potentially further improve We randomly selected about 75% of the data for training (1 million the classification performance and is a part of our ongoing and samples) and the remaining 25% for testing (365, 204 samples). The future research. The deep neural network also depicts impressive algorithms were implemented in Python 2.7.12 on a laptop running performance and and achieves an F-score of 51.61%. Ubuntu 16.04, equipped with an Intel(R) Core(TM) i7-6500U proces- We also conducted a study where we split the test samples year- sor @ 2.50 GHz. Since we are dealing with class imbalanced data, by-year, in order to assess the consistency of the model on test data we used both accuracy and F-score as our evaluation metrics in this spanning the entire time period considered in this research. We research. The results are depicted in Table 3. The nearest neighbor, used the random forest model for this analysis, due to its promising logistic regression and neural networks perform poorly on this performance (Table 3). The accuracy and the F-score results are data; they incur a high number of false positives and thus produce shown in Figures 2 and 3 respectively. We note that the model con- low F-scores (around 30%). The bagging and boosting-based meth- sistently depicts promising results (similar to that obtained in Table ods depict much better performance. GradientBoost produces the 3) across the entire time period of 11 years, further corroborating highest F-score (53.54%), while random forest produces the highest its usefulness. accuracy, and an F-score very close to the highest (52.83%). Tun- To gain further insights on our features, we ranked them according the parameters of the bagging and boosting based methods ing to their relative importance. In a random forest, the importance (the number of weak learners, parameters of the weak learners, When to Pull Starting Pitchers in Major League Baseball? A Data Mining Approach Applied Data Science Track Paper, KDD 2019

Prediction Accuracy: Year by Year 100

40 Accuracy (%) Accuracy

0 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Year

Figure 2: Bar graph showing accuracy on test samples year- by year Figure 4: Bar graph showing relative feature importance using the Random Forest classifier. Feature numbers correspond to the ordering in Table 1 Prediction F-Score: Year by Year

50 of batters faced, inning by inning. Figures 5(c) and 5(d) show the analogous 2D and 3D plots for the strike count feature. 40 We note that both these features depict relatively poor perfor-

30 mance early on in the game (when the inning number is low);

F-Score (%) F-Score however, their performance improves substantially in the later part 20 of the game. Further, as evident from the 2D graphs, at some points

10 in the earlier part of the game, there are breaks in the prediction curves; this means that the model does not predict that a pitcher 0 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 was pulled in that inning. Figure 6 depicts a plot of the total number Year of pitchers actually pulled, inning by inning (this plot was obtained by aggregating the ground truth information of the pitchers being Figure 3: Bar graph showing F-score on test samples year-by pulled across the entire training set). It is evident that replacing year pitchers early on in the game (before the fourth inning) is rare and we do not have sufficient training samples in our dataset to capture this event, which explains the poor performance of the model in of a particular feature is computed by the mean decrease of impu- the first few innings. rity [6], defined as the total decrease in node impurity (weighted by the probability of reaching that node) averaged over all trees of the ensemble. The relative importance of each of the 36 features is depicted in Figure 4 (the feature numbers correspond to the ordering presented in Table 1). It is evident that, the batter number and the strike count are the most important features in predicting when to pull the starting pitcher. The pitch count, the number of current outs, the inning and the home runs conceded by the current pitcher also have a significant impact on the final prediction. We further analyzed the effectiveness of the top two features (batter number and strike count) in predicting when to replace a starting pitcher. The 3D plot in Figure 5(a) represents the total number of pitchers pulled (actual and predicted) and the number of batters a pitcher has faced, when he is pulled; this information is presented inning by inning. The corresponding 2D plot in Figure 5(b) denotes the average number of batters a pitcher has faced when he is pulled and is presented inning by inning. Both these Figure 6: Ground truth values of the number of pitchers plots were generated by aggregating the test samples whose actual pulled, inning by inning (or predicted) labels were 1, that is, the pitchers were actually (or predicted to be) pulled and computing the corresponding number Applied Data Science Track Paper, KDD 2019 Michael Woodham, Jason Hawkins, Ankita Singh, Shayok Chakraborty

(a) Total number of pitchers pulled against the number of batters faced, inning by (b) Average number of batters a pitcher has faced when he is pulled, inning by inning inning

(c) Total number of pitchers pulled against the number of strikes, inning by inning (d) Average number of strikes thrown by a pitcher when he is pulled, inning by inning

Figure 5: Performance of the top two features (using the Random Forest model) in predicting when to pull the starting pitcher. Best viewed in color.

5 CONCLUSION AND FUTURE WORK period. We hope that our curated dataset and research findings will In this paper, we presented our initial research findings in develop- promote further research on this important topic. ing a data driven framework to predict whether or not to relieve As part of our ongoing research, we are exploring strategies to the starting pitcher at a given stage during an ongoing baseball further improve the performance of our prediction system. Infor- game. Replacing the starting pitcher is a decision the manager has mation fusion from multiple sources has depicted promising per- to make and it has a direct consequence on the result of the game formance in a variety of machine learning applications [1]. Since and also on the physical health of the pitcher. Our empirical results bagging and boosting based models produced the best results in show tremendous promise in using machine learning algorithms our experiments (Table 3), we selected Random Forest, AdaBoost to predict the optimal time of replacing a starting pitcher. Through and LightGBM to form a committee of classifiers; a simple majority our feature analysis experiments, we attempted to shed light on the voting scheme was used to fuse the predictions from the three mod- importance of individual predictors in making the final decision. els. This method produced an accuracy of 96.52% and an F-score of To the best of our knowledge, this is the first research effort to use 58.67%. More advanced fusion techniques (such as linear weighted a data driven approach to model managers’ decisions on when to fusion, classification-based fusion and the maximum entropy model replace a starting pitcher. Such a system can potentially be a tool to based fusion among others [1]) can potentially further improve the help managers make informed decisions during an ongoing game, prediction performance. as they will be equipped with the knowledge of other managers’ Recursive Feature Elimination (RFE) is a feature selection algo- decisions in a variety of situations over a 11-year aggregated time rithm, which repeatedly constructs a model and removes features with low weights. The model is first trained on the complete set of features and the features are ranked according to their importance When to Pull Starting Pitchers in Major League Baseball? A Data Mining Approach Applied Data Science Track Paper, KDD 2019 to the model. The least important features are then pruned and the [18] M. Hamilton, P. Hoang, L. Layne, J. Murray, D. Padget, C. Stafford, and H. Tran. process is repeated recursively until the desired number of features 2014. Applying Machine Learning Techniques to Baseball Pitch Prediction. In International Conference on Pattern Recognition Applications and Methods. are obtained. RFE has depicted impressive performance in a variety [19] T. Hastie, R. Tibshirani, and J. Friedman. 2001. The Elements of Statistical Learning. of applications [27, 46]. In our ongoing work, we are studying the Springer New York Inc. prediction performance of RFE with Random Forest; our initial [20] H. He and E. Garcia. 2009. Learning from Imbalanced Data. IEEE Transactioons on Knowledge and Data Engineering (TKDE) 21, 9 (2009), 1263 – 1284. efforts in this direction have shown promising results. [21] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image As part of future research, we plan to develop models that are Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). specific to individual pitchers, rather than a league-wide compre- [22] G. Healey. 2015. Modeling the Probability of a Strikeout for a Batter/Pitcher hensive model. That can potentially improve the performance of Matchup. IEEE Transactions on Knowledge and Data Engineering (TKDE) 27, 9 our system, as it can capture the nuances of individual pitchers. (2015), 2415 – 2423. [23] G. Healey. 2017. Matchup Models for the Probability of a Ground Ball and a We further plan to exploit domain adaptation algorithms to de- Ground Ball Hit. Journal of Sports Analytics 3, 1 (2017), 21 – 35. velop a model for a particular pitcher by leveraging data from other [24] D. Herrlin. 2015. Forecasting MLB Performane Utilizing a Bayesian Approach pitchers. Domain adaptation or transfer learning algorithms trans- in Order to Optimize a Fantasy Baseball Draft. In PhD thesis: San Diego State University. fer relevant knowledge from a source domain to develop a model [25] P. Hoang. 2015. Supervised Learning in Baseball Pitch Prediction and Hepatitis for a (related) target domain [38] and have depicted commendable C Diagnosis. In PhD Thesis: North Carolina State University. [26] P. Hoang, M. Hamilton, J. Murray, C. Stafford, and H. Tran. 2015. A Dynamic performance in a variety of applications [17, 36]. Further, while Feature Selection Based LDA Approach to Baseball Pitch Prediction. In Trends SMOTE has depicted impressive performance for learning in the and Applications in Knowledge Discovery and Data Mining. presence of skewed class distributions, other extensions have been [27] X. Huang, L. Zhang, B. Wang, F. Li, and Z. Zhang. 2018. Feature clustering based support vector machine recursive feature elimination for gene selection. Applied proposed, including SMOTEBoost (combination of SMOTE and Ad- Intelligence 48, 3 (2018), 594Ű607. aboost, where synthetic samples are generated in each boosting [28] S. Huddleston. 2012. Hitters vs. Pitchers: A Comparison of Fantasy Baseball round) and kernel-based methods [20]. We will explore these ad- Player Performances Using Hierarchical Bayesian Models. In PhD thesis: Brigham Young University-Provo. vanced algorithms in our future research and study their effects on [29] T. Ishii. 2016. Using Machine Learning Algorithms to Identify Undervalued predictive performance. Baseball Players. In Technical Report: Stanford University. [30] W. Jang, A. Nasridinov, and Y. Park. 2014. Analyzing and Predicting Patterns in Baseball Data using Machine Learning Techniques. In Advanced Science and Technology Letters. REFERENCES [31] S. Jensen, B. McShane, and A. Wyner. 2009. Hierarchical Bayesian Modeling of Hitting Performance in Baseball. Bayesian Analysis 4, 4 (2009), 631 Ű 652. [1] P. Atrey, M. Hossain, A. El-Saddik, and M. Kankanhalli. 2010. Multimodal Fusion [32] W. Jiang and C. Zhang. 2010. Empirical Bayes In-season Prediction of Baseball for Multimedia Analysis: A Survey. Multimedia Systems 16, 6 (2010), 345Ű379. Batting Averages. Institute of Mathematical Statistics Collections 6 (2010), 263 – [2] A. Attarian, G. Danis, J. Gronsbell, G. Iervolino, L. Layne, D. Padgett, and H. Tran. 273. 2014. Baseball Pitch Classification: A Bayesian Method and Dimension Reduction [33] G. Ke, Q. Meng, T. Finely, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu. 2017. Investigation. In IAENG Transactions on Engineering Sciences: Special Issue of the LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances of International MultiConference of Engineers and Computer Scientists. Neural Information Processing Systems (NIPS). [3] A. Attarian, G. Danis, J. Gronsbell, G. Iervolino, and H. Tran. 2013. A Comparison [34] K. Koseler and M. Stephan. 2017. Machine Learning Applications in Baseball: A of Feature Selection and Classification Algorithms in Identifying Baseball Pitches. Systematic Literature Review. Applied Artificial Intelligence 31, 9-10 (2017), 745 – In International MultiConference of Engineers and Computer Scientists. 763. [4] S. Barnes and M. Bjarnadóttir. 2016. Great Expectations: An Analysis of Major [35] A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet Classification with League Baseball Free Agent Performance. Statistical Analysis and Data Mining 9, Deep Convolutional Neural Networks. In Neural Information Processing Systems 5 (2016), 295 – 309. (NIPS). [5] J. Bock. 2015. Pitch Sequence Complexity and Long-Term Pitcher Performance. [36] M. Long, Y. Cao, J. Wang, and M. Jordan. 2015. Learning Transferable Features Sports 3, 1 (2015), 40 – 55. with Deep Adaptation Networks. In International Conference on Machine Learning [6] L. Breiman, J. Friedman, C. Stone, and R. Olshen. 1984. Classification and Regres- (ICML). sion Trees. Taylor and Francis. [37] A. Lyle. 2007. Baseball Prediction using Ensemble Learning. In PhD thesis: Uni- [7] N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. 2002. SMOTE: Synthetic versity of Georgia. Minority Over-sampling Technique. Journal of Artificial Intelligence Research [38] S. Pan and Q. Yang. 2010. A Survey on Transfer Learning. IEEE Transactions on (JAIR) 16, 1 (2002), 321 – 357. Knowledge and Data Engineering (TKDE) 22, 10 (2010). [8] T. Chen and C. Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In [39] G. Sidle. 2017. Using Multi-Class Machine Learning Methods to Predict Major ACM Conference on Knowledge Discovery and Data (KDD). League Baseball Pitches. In PhD thesis, North Carolina State University. [9] R. Das and S. Das. 1994. Catching a Baseball: a Reinforcement Learning Per- [40] G. Stevens. 2013. Bayesian Statistics and Baseball. In PhD thesis, Pomona College. spective using a Neural Network. In Association for the Advancement of Artificial [41] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van- Intelligence (AAAI). houcke, and A. Rabinovich. 2015. Going Deeper with Convolutions. In IEEE [10] B. Everman. 2015. Analyzing Baseball Statistics Using Data Mining. In Accessed Conference on Computer Vision and Pattern Recognition (CVPR). January 9, 2018. [42] B. Tolbert and T. Trafalis. 2016. Predicting Major League Baseball Championship [11] M. Fichman and M. Fichman. 2012. From Darwin to the Diamond: How Baseball Winners through Data Mining. Athens Journal of Sports 3, 4 (2016), 239 – 252. and Billy Beane Arrived at Moneyball. SSRN Electronic Journal (2012). [43] D. Tung. 2012. Data Mining Career Batting Performances in Baseball. In Last [12] M. Freiman. 2010. Using Random Forests and Simulated Annealing to Predict accessed January 9, 2018: http://vixra.org/pdf/1205.0104v1.pdf. Probabilities of Election to the Baseball Hall of Fame. Journal of Quantitative [44] C. Soto Valero. 2016. Predicting Win-Loss Outcomes in MLB Regular Season Analysis in Sports 6, 2 (2010), 1 – 35. Games: A Comparative Study using Data Mining Methods. International Journal [13] J. Friedman. 2001. Greedy Function Approximation: A Gradient Boosting Machine. of Computer Science in Sport 15, 2 (2016), 91 – 112. Annals of Statistics 29, 5 (2001), 1189Ű1232. [45] T. Yang and T. Swartz. 2004. A Two-stage Bayesian Model for Predicting Winners [14] J. Friedman. 2002. Stochastic Gradient Boosting. Computational Statistics and in Major League Baseball. Journal of Data Science 2, 1 (2004), 61 – 73. Data Analysis 38, 4 (2002), 367Ű378. [46] L. Zhang, F. Peng, L. Qin, and M. Long. 2018. Face spoofing detection based [15] G. Ganeshapillai and J. Guttag. 2012. Predicting the Next Pitch. In Sloan Sports on color texture Markov feature and support vector machine recursive feature Analytics Conference. elimination. Journal of Visual Communication and Image Representation 51 (2018), [16] G. Ganeshapillai and J. Guttag. 2014. A Data-driven Method for In-game Decision 56–69. Making in MLB. In Sloan Sports Analytics Conference. [17] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. 2016. Domain-Adversarial Training of Neural Networks. Journal of Machine Learning Research (JMLR) 17 (2016).