Predicting the Potential of Professional Soccer Players
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Machine Learning Applications in Baseball: a Systematic Literature Review
This is an Accepted Manuscript of an article published by Taylor & Francis in Applied Artificial Intelligence on February 26 2018, available online: https://doi.org/10.1080/08839514.2018.1442991 Machine Learning Applications in Baseball: A Systematic Literature Review Kaan Koseler ([email protected]) and Matthew Stephan* ([email protected]) Miami University Department of Computer Science and Software Engineering 205 Benton Hall 510 E. High St. Oxford, OH 45056 Abstract Statistical analysis of baseball has long been popular, albeit only in limited capacity until relatively recently. In particular, analysts can now apply machine learning algorithms to large baseball data sets to derive meaningful insights into player and team performance. In the interest of stimulating new research and serving as a go-to resource for academic and industrial analysts, we perform a systematic literature review of machine learning applications in baseball analytics. The approaches employed in literature fall mainly under three problem class umbrellas: Regression, Binary Classification, and Multiclass Classification. We categorize these approaches, provide our insights on possible future ap- plications, and conclude with a summary our findings. We find two algorithms dominate the literature: 1) Support Vector Machines for classification problems and 2) k-Nearest Neighbors for both classification and Regression problems. We postulate that recent pro- liferation of neural networks in general machine learning research will soon carry over into baseball analytics. keywords: baseball, machine learning, systematic literature review, classification, regres- sion 1 Introduction Baseball analytics has experienced tremendous growth in the past two decades. Often referred to as \sabermetrics", a term popularized by Bill James, it has become a critical part of professional baseball leagues worldwide (Costa, Huber, and Saccoman 2007; James 1987). -
A Giant Whiff: Why the New CBA Fails Baseball's Smartest Small Market Franchises
DePaul Journal of Sports Law Volume 4 Issue 1 Summer 2007: Symposium - Regulation of Coaches' and Athletes' Behavior and Related Article 3 Contemporary Considerations A Giant Whiff: Why the New CBA Fails Baseball's Smartest Small Market Franchises Jon Berkon Follow this and additional works at: https://via.library.depaul.edu/jslcp Recommended Citation Jon Berkon, A Giant Whiff: Why the New CBA Fails Baseball's Smartest Small Market Franchises, 4 DePaul J. Sports L. & Contemp. Probs. 9 (2007) Available at: https://via.library.depaul.edu/jslcp/vol4/iss1/3 This Notes and Comments is brought to you for free and open access by the College of Law at Via Sapientiae. It has been accepted for inclusion in DePaul Journal of Sports Law by an authorized editor of Via Sapientiae. For more information, please contact [email protected]. A GIANT WHIFF: WHY THE NEW CBA FAILS BASEBALL'S SMARTEST SMALL MARKET FRANCHISES INTRODUCTION Just before Game 3 of the World Series, viewers saw something en- tirely unexpected. No, it wasn't the sight of the Cardinals and Tigers playing baseball in late October. Instead, it was Commissioner Bud Selig and Donald Fehr, the head of Major League Baseball Players' Association (MLBPA), gleefully announcing a new Collective Bar- gaining Agreement (CBA), thereby guaranteeing labor peace through 2011.1 The deal was struck a full two months before the 2002 CBA had expired, an occurrence once thought as likely as George Bush and Nancy Pelosi campaigning for each other in an election year.2 Baseball insiders attributed the deal to the sport's economic health. -
Major Qualifying Project: Advanced Baseball Statistics
Major Qualifying Project: Advanced Baseball Statistics Matthew Boros, Elijah Ellis, Leah Mitchell Advisors: Jon Abraham and Barry Posterro April 30, 2020 Contents 1 Background 5 1.1 The History of Baseball . .5 1.2 Key Historical Figures . .7 1.2.1 Jerome Holtzman . .7 1.2.2 Bill James . .7 1.2.3 Nate Silver . .8 1.2.4 Joe Peta . .8 1.3 Explanation of Baseball Statistics . .9 1.3.1 Save . .9 1.3.2 OBP,SLG,ISO . 10 1.3.3 Earned Run Estimators . 10 1.3.4 Probability Based Statistics . 11 1.3.5 wOBA . 12 1.3.6 WAR . 12 1.3.7 Projection Systems . 13 2 Aggregated Baseball Database 15 2.1 Data Sources . 16 2.1.1 Retrosheet . 16 2.1.2 MLB.com . 17 2.1.3 PECOTA . 17 2.1.4 CBS Sports . 17 2.2 Table Structure . 17 2.2.1 Game Logs . 17 2.2.2 Play-by-Play . 17 2.2.3 Starting Lineups . 18 2.2.4 Team Schedules . 18 2.2.5 General Team Information . 18 2.2.6 Player - Game Participation . 18 2.2.7 Roster by Game . 18 2.2.8 Seasonal Rosters . 18 2.2.9 General Team Statistics . 18 2.2.10 Player and Team Specific Statistics Tables . 19 2.2.11 PECOTA Batting and Pitching . 20 2.2.12 Game State Counts by Year . 20 2.2.13 Game State Counts . 20 1 CONTENTS 2 2.3 Conclusion . 20 3 Cluster Luck 21 3.1 Quantifying Cluster Luck . 22 3.2 Circumventing Cluster Luck with Total Bases . -
Regimes in Baseball Players' Career Data
Data Min Knowl Disc (2017) 31:1580–1621 DOI 10.1007/s10618-017-0510-5 Regimes in baseball players’ career data Marcus Bendtsen1 Received: 28 February 2016 / Accepted: 12 May 2017 / Published online: 27 May 2017 © The Author(s) 2017. This article is an open access publication Abstract In this paper we investigate how we can use gated Bayesian networks, a type of probabilistic graphical model, to represent regimes in baseball players’ career data. We find that baseball players do indeed go through different regimes throughout their career, where each regime can be associated with a certain level of performance. We show that some of the transitions between regimes happen in conjunction with major events in the players’ career, such as being traded or injured, but that some transitions cannot be explained by such events. The resulting model is a tool for managers and coaches that can be used to identify where transitions have occurred, as well as an online monitoring tool to detect which regime the player currently is in. 1 Introduction Baseball has been the de facto national sport of the United States since the late nine- teenth century, and its popularity has spread to Central and South America as well as the Caribbean and East Asia. It has inspired numerous movies and other works of arts, and has also influenced the English language with expressions such as “hitting it out of the park” and “covering one’s bases”, as well as being the origin of the concept of a rain-check. As we all are aware, the omnipresent baseball cap is worn across the world. -
Predicting Baseball Win/Loss Records from Player Projections
Predicting Baseball Win/Loss Records from Player Projections Connor Daly [email protected] November 29, 2017 1 Introduction When forecasting future results in major league baseball (MLB), there are essentially two sources from which you can derive your predictions: teams and players. How do players perform individually, and how do their collaborative actions coalesce to form a team's results? Several methods of both types exist, but often are shrouded with proprietary formulas. Currently, several mature and highly sophisticated player projection systems are used to forecast season results. None are abundantly transparent about their methodology. Here I set out to develop the simplest possible player-based team projection system and try to add one basic improvement. 2 Predicting Wins and Losses 2.1 Team-Based Projections One approach to such forecasting is to analyze team performance in head to head matchups. A common implementation of this approach is known as an elo rating and prediction system. 1 Elo systems start by assigning teams an average rating. After games are played, winning teams' ratings increase and losing teams' ratings decrease relative to the expected outcome of the matchup. Expected outcomes are determined by the difference in rating between the two teams. If a very good team almost loses to a really bad team, its rating will only increase slightly. If an underdog pulls of an upset, however, it will earn relatively more points. As more games are played, older games become progressively less meaningful. Essentially, this prediction method considers only who you played, what the margin of victory was, and where the match was played (home team advantage is adjusted for). -
Forecasting Outcomes of Major League Baseball Games Using Machine Learning
Forecasting Outcomes of Major League Baseball Games Using Machine Learning EAS 499 Senior Capstone Thesis University of Pennsylvania An undergraduate thesis submitted in partial fulfillment of the requirements for the Bachelor of Applied Science in Computer and Information Science Andrew Y. Cui1 Thesis Advisor: Dr. Shane T. Jensen2 Faculty Advisor: Dr. Lyle H. Ungar3 April 29, 2020 1 Candidate for Bachelor of Applied Science, Computer and Information Science. Email: [email protected] 2 Professor of Statistics. Email: [email protected] 3 Professor of Computer and Information Science. Email: [email protected] Abstract When two Major League Baseball (MLB) teams meet, what factors determine who will win? This apparently simple classification question, despite the gigabytes of detailed baseball data, has yet to be answered. Published public-domain work, applying models ranging from one-line Bayes classifiers to multi-layer neural nets, have been limited to classification accuracies of 55- 60%. In this paper, we tackle this problem through a mix of new feature engineering and machine learning modeling. Using granular game-level data from Retrosheet and Sean Lahman’s Baseball Database, we construct detailed game-by-game covariates for individual observations that we believe explain greater variance than the aggregate numbers used by existing literature. Our final model is a regularized logistic regression elastic net, with key features being percentage differences in on-base percentage (OBP), rest days, isolated power (ISO), and average baserunners allowed (WHIP) between our home and away teams. Trained on the 2001-2015 seasons and tested on over 9,700 games during the 2016-2019 seasons, our model achieves a classification accuracy of 61.77% and AUC of 0.6706, exceeding or matching the strongest findings in the existing literature, and stronger results peaking at 64-65% accuracy when evaluated monthly. -
Forecasting Batting Averages in MLB
Forecasting Batting Averages in MLB by Sarah Reid Bailey B.Sc. University of the Pacific, 2015 Project Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in the Department of Statistics and Actuarial Science Faculty of Science c Sarah Reid Bailey 2017 SIMON FRASER UNIVERSITY Fall 2017 Copyright in this work rests with the author. Please ensure that any reproduction or re-use is done in accordance with the relevant national copyright legislation. Approval Name: Sarah Reid Bailey Degree: Master of Science (Statistics) Title: Forecasting Batting Averages in MLB Examining Committee: Chair: Jinko Graham Professor Tim Swartz Senior Supervisor Professor Simon Fraser University Jason Loeppky Supervisor Associate Professor UBC Okanagan Peter Chow-White Internal Examiner Associate Professor Date Defended: November 14, 2017 ii Abstract We consider new baseball data from Statcast which includes launch angle, launch velocity, and hit distance for batted balls in Major League Baseball during the 2015, and 2016 seasons. Using logistic regression, we train two models on 2015 data to get the probability that a player will get a hit on each of their 2015 at-bats. For each player we sum these predictions and divide by their total at bats to predict their 2016 batting average. We then use linear regression, which expresses 2016 actual batting averages as a linear combination of 2016 Statcast predictions and 2016 PECOTA predictions. When using this procedure to obtain 2017 predictions, we find that the combined prediction performs better than PECOTA. This information may be used to make better predictions of batting averages for future seasons. -
Projecting the Future Individual Contributions of Nhl
PROJECTING THE FUTURE INDIVIDUAL CONTRIBUTIONS OF NHL PLAYERS A Thesis by AARON THOMAS KNODELL Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Chair of Committee, James Caverlee Committee Members, Zhangyang Wang Irina Gaynanova Head of Department, Dilma Da Silva May 2019 Major Subject: Computer Engineering Copyright 2019 Aaron Thomas Knodell ABSTRACT Professional sports are a multibillion-dollar industry with millions of people invested in the outcomes of games and seasons. Owners, management, and fans sit on the edges of their seats wondering what will happen next. Lots of work has been done forecasting success at the team level across a variety of sports, but player level predictions are less common. Predictive work related to the NHL is even rarer. This thesis explores the ability to predict NHL player performance in a given season using publicly available information via statistical learning methods. Data featured in the analysis includes play-by-play and shift information, box score statistics, a variety of composite and catch-all statistics, injury information, and player biographical information. Data was compiled and analyzed to find meaningful relationships between past and future performance. The results of the analysis found the most predictive values in the given data set and leveraged them to build a predictive model. The findings suggest that a previous season’s raw numbers can be supplemented with more information to improve predictive power. ii CONTRIBUTORS AND FUNDING SOURCES Contributors This work was supervised by a thesis committee consisting of Professor James Caverlee and Zhangyang Wang of the Department of Computer Science and Engineering and Professor Irina Gaynanova of the Department of Statistics. -
Simulation-Based Projections for Baseball Statistics
Simulation-Based Projections for Baseball Statistics A Thesis Presented to the Faculty of California State Polytechnic University, Pomona In Partial Fulfillment Of the Requirements for the Degree Master of Science In Computer Science By Daniel Adam Acevedo 2018 SIGNATURE PAGE THESIS: SIMULATION-BASED PROJECTIONS FOR BASEBALL STATISTICS AUTHOR: DANIEL ADAM ACEVEDO TERM SUBMITTED: Spring 2018 Computer Science Department Dr. Yu Sun ____________________________________ Thesis Committee Chair Department of Computer Science Dr. Abdelfattah Amamra ____________________________________ Department of Computer Science Dr. Sampath Jayarathna ____________________________________ Department of Computer Science ii ACKNOWLEDGEMENTS I would like to thank my family for their love, support, and for all of the sacrifices they’ve made so that I could get an education, resulting in a Master’s Degree. I would like to thank my advisor, Dr. Yu Sun, for his guidance and help throughout my time conducting this thesis. I would also like to thank my Dr. Sampath Jayarathna and Dr. Abdelfattah Amamra for being members of my committee. iii ABSTRACT Baseball is an unpredictable sport. The introduction of sabermetrics established an opening for the application of computer science methods within the game’s evaluation. Every Major League Baseball organization has developed their own method of measuring players’ results and making predictions as to what they should expect from a player entering a season. While most industry models use their own statistical analysis to perform predictions, this thesis introduces a new model that uses simulations in addition to statistical analysis in order to make predictions. The results of this thesis show that this model is comparable to some of the best projection systems available. -
A Data-Driven Method for In-Game Decision Making in MLB
A Data-driven Method for In-game Decision Making in MLB Gartheeban Ganeshapillai, John Guttag Massachusetts Institute of Technology Cambridge, MA, USA 02139 Email: [email protected] Abstract In this paper we show how machine learning can be applied to generate a model that could lead to better on-field decisions by predicting a pitcher’s performance in the next inning. Specifically we show how to use regularized linear regression to learn pitcher-specific predictive models that can be used to estimate whether a starting pitcher will surrender a run if allowed to start the next inning. For each season we trained on the first 80% of the games, and tested on the rest. The results suggest that using our model would frequently lead to different decisions late in games than those made by major league managers. There is no way to evaluate would have happened when a manager lifted a pitcher that our model would have allowed to continue. From the 5th inning on in close games, for those games in which a manager left a pitcher in that our model would have removed, the pitcher ended up surrendering at least one run in that inning 60% (compared to 43% overall) of the time. 1 Introduction Perhaps the most important in-game decision a baseball manager makes is when to relieve a starting pitcher [2]. Today, managers rely on various heuristics (e.g., pitch count) to decide when a starting pitcher should be relieved [2, 17]. In this paper, we use machine learning to derive a model that can be used to assist in making these decisions in a principled way. -
Summaries of Research Using Retrosheet Data
Acharya, Robit A., Alexander J. Ahmed, Alexander N. D’Amour, Haibo Lu, Carl N. Morris, Bradley D. Oglevee, Andrew W. Peterson, and Robert N. Swift (2008). Improving major league baseball park factor estimates. Journal of Quantitative Analysis in Sports, Vol. 4 Issue 2, Article 4. There are at least two obvious problems with the original Pete Palmer method for determining ballpark factor: assumption of a balanced schedule and the sample size issue (one year is too short for a stable estimate, many years usually means new ballparks and changes in the standing of any specific ballpark relative to the others). A group of researchers including Carl Morris (Acharya et al., 2008) discerned another problem with that formula; inflationary bias. I use their example to illustrate: Assume a two-team league with Team A’s ballpark “really” has a factor of 2 and Team B’s park a “real” factor of .5. That means four times as many runs should be scored in the first as in the second. Now we assume that this hold true, and that in two-game series at each park each team scores a total of eight runs at A’s home and two runs a B’s. If you plug these numbers into the basic formula, you get 1 – (8 + 8) / 2 = 8 for A; (2 + 2) / 2 = 2 for B 2 – (2 + 2) / 2 = 2 for A; (8 + 8) / 2 = 8 for B 3 – 8 / 2 = 4 for A; 2 / 8 = .25 for B figures that are twice what they should be. The authors proposed that a simultaneous solving of a series of equations controlling for team offense and defense, with the result representing the number of runs above or below league average the home park would give up during a given season. -
SEAM Methodology for Context-Rich Player Matchup Evaluations in Baseball
SEAM methodology for context-rich player matchup evaluations in baseball Abstract We develop SEAM (synthetic estimated average matchup) methodology which can be used to evaluate batter versus pitcher matchups, both numerically and visually. We estimate the distribution of balls put into play by a batter facing a pitcher, called the spray chart distribution. This distribution is conditional on individual batter and pitcher characteristics, which are a better expression of talent than conventional statis- tics. Many individual matchups have a sample size that is too small to be reliable. Synthetic versions of the batter and pitcher under consideration are constructed in order to alleviate these concerns. Weights governing how much influence these syn- thetic players have on the overall spray chart distribution are constructed to minimize expected mean square error. We provide novel performance metrics that are calculated as expectations taken with respect to the spray chart distribution. These performance metrics provide a context rich approach to player evaluation. We also provide a Shiny app that allows users to visualize and evaluate any batter-pitcher matchup that has occurred or could have occurred in the last five years. One can access this app at https://seam.shinyapps.io/seam/. Our methodology and interactive tool has util- ity for anyone interested in baseball. Keywords: Nonparametric density estimation; Similarity scores; Model averaging; Repro- ducible research; Sabermetrics; Big data applications and visualization 1 Introduction Baseball has a rich statistical history dating back to the first box score created by Henry Chadwick in 1859. Fans, journalists, and teams have obsessed over baseball statistics and performance metrics ever since.