A STATISTICAL ANALYSIS OF PITCH F/X DATA Author: Jared Graham, Advisor: Dr. Erik Insko Florida Gulf Coast University, College of Arts and Sciences

ABSTRACT PITCH F/X VARIABLES CONCLUSION

My research analyzes statistics to determine if • Release Point: A must begin his throwing • Velocity: the speed and direction of the pitch, in feet per The four variables that significantly affect the motion while standing on the pitching mound which is second, measured at the point of release. there is evidence that a correlation exists between the out- outcome of a pitch are End Speed, X Position, Z po- approximately 60 feet from home plate. The position of the • Pitch Type: the characterization of the pitch based on come of a pitch and the characteristics of the pitch. I will sition and Z Velocity, found in Model C. This seems be using multiple linear regression to conduct my research. ball at which a pitcher releases it would be the release point velocity, trajectory, movement, hand position, wrist position The data being used is recorded with Pitch f/x stats tech- and varies from pitcher to pitcher. and/or arm angle. intuitive since the faster the ball travels, the more nology. Some of the pitching characteristics to consider in- • Break Angle: the angle, in degrees, from vertical of the difficult it is to hit. The position of the ball as it clude, velocity, spin rate, rotation, and different pitch types straight line path of the release point to where the pitch reaches home plate determines whether or not the crossed the front of home plate. as well as the frequency of the type of pitch. Finding a cor- ball is in the strike zone. Since the coefficients for relation between these variables and the outcome of a pitch (ball, strike, hit) could provide an indication as to which X Position (horizontal orientation) and Z Position aspects of a pitch are significant. (vertical orientation) are negative, the further from the center of plate the pitch is, the more likely the outcome will be a ball (1). End Speed and Z Ve- INTRODUCTION • Movement: Horizontal and vertical movement of the pitch from the release point to home plate. Other factors of locity both have positive coefficients so the larger movement include Spin Rate (RPM) and Rotation (0-360 the input, the more likely the pitch will result in a Baseball is one of the most statistics-driven degrees). strike (2). If a pitcher can control these aspects of sports in the world. Baseball stats can give us insight his pitch, he can control the outcome. into various players and their talents. This allows individuals to make informed decisions about player MULTIPLE LINEAR REGRESSION MODELS performance such as scouting, career longevity, and Model A: Pitch Outcome = 4.799 ∗ 10−1 + 4.029 ∗ 10−8(Pitcher ID) + −3.751 ∗ 10−1(Handedness) + pitcher placement. My research looks at Pitch f/x −1.013 ∗ 10−2(Pitch Type) + −3.531 ∗ 10−2(Start Speed) + 6.425 ∗ 10−2(End Speed) + data from a tracking system that logs the data for −3.138 ∗ 10−5(Spin Rate) + 3.682 ∗ 10−4(Spin Direction) + −8.264 ∗ 10−3(X Position) + −7.138 ∗ 10−2(Z every pitch thrown and compares this information Position) + 4.171 ∗ 10−4(Break Angle) + −2.929 ∗ 10−2(X Velocity) + 2.566 ∗ 10−2(Z Velocity) + to previous times a player pitched. It is important −5.180 ∗ 10−4(Nasty) to review pitching statistics since the pitcher is the Model B: Pitch Outcome = 5.264 ∗ 10−2 + −4.123 ∗ 10−2(Start Speed) + 7.024 ∗ 10−2(End Speed) + REFERENCES player that controls the cadence of the game. I have −6.088 ∗ 10−2(X Position) + −9.506 ∗ 10−2(Z Position) + 3.880 ∗ 10−2(Z Velocity) chosen to use Pitch f/x data from the 2016 World Model C: Pitch Outcome = 4.344 ∗ 10−2 + 2.653 ∗ 10−2(End Speed) + −6.248 ∗ 10−2(X Position) + [1] Bovas Abraham and Johannes Ledolter. Series. I will be analyzing select from the −1 −2 −1.097 ∗ 10 (Z Position) + 4.247 ∗ 10 (Z Velocity) Introduction to Regression Modeling. and the . Brooks/Cole, Cengage Learning, 2006. [2] Mike Fast. Glossary of the gameday pitch fields. STATISTICS TERMINOLOGY METHODOLOGY August 2007. [3] MLB Data Source. To begin, I cleansed the data from Pitch f/x so it could be http://gd2.mlb.com/components/game/mlb/year_ vs. 1 The Multiple Linear Regression Model can be d d input into the linear model function in R. Model A is the origi- 2016/pitchers/. written as Y = b0 + b1X1 + b2X2 + ... + bpXp, where Y is nal model with the largest number of variables. Once I created Accessed: 2019-02-25. PITCHERS ANALYZED the expected value of the dependent variable, X1 through a model, I could analyze the significance of each variable with Xp are p distinct predictor variables and b1 through bp are the estimated regression coefficients. the summary function. Model B includes only the statistically Indians Pitchers: Cubs Pitchers: significant variables. Next, I tested for multicollinearity among The Akaike’s Information Criterion (AIC) is a 2 the variables and chose the ones that were not dependent on ACKNOWLEDGEMENTS • measure of the relative quality of statistical models for a each other so each variable independently affects the outcome given set of data. Given a collection of models for the data, • of the pitch. This is how I created Model C. To verify Model I would like to thank my advisor Erik Insko for helping me AIC estimates the quality of each model, relative to each of C was the best fitted model, I calculated the AIC for all three with this research. I would also like to thank Anna Eggleston • the other models. Hence, AIC provides a means for model models. for her help in guiding me through the R related aspects of my Cody Allen selection. It is defined as • • research. • Bryan Shaw • AICp = n ln(SSEp/n) + 2(p + 1) Model A B C AIC 3221.01 3214.40 3092.49