An In-Depth analysis of play for the 2019 NFL regular season

Demetrios Papakostas

Arizona State University School of Mathematical and Statistical Sciences

November 24, 2020

Abstract

In this paper, we investigate data pertaining to quarterback performance for the 2019 NFL season. Our paper is divided into two main parts. In part 1, we are interested in an evaluation of traditional counting statistics and a comparison of players based on these statistics. In part 2, we take a causal interpretation to the performance of . We ask the question how much does a certain quarterback cause their own success, and how much of it is due to the situation they are placed in. To do so, we predict how a quarterback should have performed in the conditions they were in by performing a regression analysis on a covariate set of exogenous variables. We conclude that some players gaudy statistics may be in part influenced by their situation, while other quarterbacks may be in fact held back the system they operate in.

1 Introduction

Evaluation of quarterback’s performance is not entirely new. In fact, many statistics and grades have been created as evaluation tools. These include, for example traditional quar- terback rating, and newer more advanced metrics such as ESPN’s qbr Katz and Burke [2017]. Of course, all of these have been subject to much criticism, and there has been recent work to create more easily accessible and reproducible statistics, such as nflWAR Yurko et al.[2019]. While the criticism is certainly valid and alternative statistics may be preferable, we present a different interpretation of these statistics. We assume, for exam- ple, that quarterback rating is a good measure of overall contribution to a winning football

1 Demetrios Papakostas Quarterback Comparison team, see figure Byrne[August 3, 2011]. So in terms of overall relevance to team success, we rely on the assumption that (and QBR) are at least somewhat relevant to team success (add a chart here?). So, giving all this, the goal of this paper is not to question the validity of any of these metrics. Rather, we present an analysis from a causal perspec- tive. We build on a growing literature of using Bayesian methodology in sports analytics, see Santos-Fernandez et al.[2019] for a comprehensive review.

1.1 Causal Analysis

Our causal analysis is based on the counter-factual line of thinking. Specifically, we con- sider the potential outcome framework of Imbens and Rubin[2006]. Specifically, we define an outcome (passer rating or QBR score) for a quarterback as Di, and in this setting our outcome is continuous. Ei is the exposure for user i. This takes a bit of a different meaning than usual in the causal inference literature. The exposure (or treatment) is the specific quarterback themselves being inserted into the team/system they are in. This is not strictly a treatment per se, but it will allow us to ask the question of what the benefits of having said quarterback on the specific team are.

We now introduce the concept of a potential outcome, originally formulated by Rubin 1 Rubin[1974]. In terms of potential outcomes, we are interested in two scenarios: Di and 0 Di , which are respectively the outcome of a certain teams quarterback output if they had their current quarterback vs how we’d expect an average quarterback to perform in the system. However, we only ever get to observe one of these situations, but not both, and this is referred to as the “fundamental problem of causal inference”[Holland, 1986]. Specifically, 1 here we observe Di . We got to watch Patrick Mahomes in Kansas City, we did not get to see how Derek Carr (a proxy for an “average” NFL QB) would perform in that system. In particular, unmeasured confounding between exposure and outcome may bias the estimate of the causal effect, and the naive estimate E(Di Ei 1) E(Di Ei 0) is often invalid, | = − | = because we do not control for unobserved confounding [Pearl, 2000]. Figure1 demonstrates the issue at hand here. U Confounding

E D Quarterbacks Talent Measured Performance

Figure 1: The causal graph illustrates the main issue at play. There is unobserved information that affects both the treatment and outcome, making direct causal estimates inaccurate. Since we control for the observed covariates, as that data is available to the researcher, we essentially “delete” the arrows between U and E and D, an illustration of the ignorability assumption.

Our major assumption in our study is that we have sufficient data to control for con-

2 Demetrios Papakostas Quarterback Comparison

n 1 0o

founding, i.e. the ignorability assumption that E |= D ,D U. That is, we posit that the | data we have in x, described in further detail in section[2], is enough to describe the con- founding that affects both the quarterback being the quarterback of a team and their pro- duction. For example, while Daniel Jones is considered very talented, he was stuck behind a struggling offensive line in 2019. Therefore, part of his struggles could be attributed to this struggling offensive line. Yet again, because of Daniel Jones mobility, his team may feel less need to invest in offensive line, so the offensive line strength varying between teams confounds using one metric as a sole evaluator of production.

Of course, this assumption is lacking. For example, we do not control for factors such as weather, fans, officiating, and more that could have major impact on a players produc- tivity that is out of their control. However, our data set is pretty thorough and takes into account factors that most former players, coaches, and analysts pinpoint as factors that help determine a team and quarterbacks success.

Our solution, while certainly flawed and burdened with heavy assumptions, is to esti- mate how a quarterback plopped onto a team would perform based solely on factors that the quarterback cannot control. To do so, we use a modern non-linear machine learning regres- sion tool to predict outcomes for a quarterback giving certain covariates, where the weights of each covariate are trained on the covariates of every other quarterback in the data set. We interpret this as an estimate of D0, the potential outcome that relates to the expected metric of the performance of a quarterback in the league placed on a certain team’s system. That is, if we are given nothing but data about a team (irrespective of the quarterback), giving the data on all the quarterbacks in the league that we used to weight the variables in this data-set during the regression stage of modeling, we predict how well a quarterback should perform in the system. We then compare this output to what we observed as a more fair estimator of production.

2 The Data

2.1 Comparison Data

In this section, we compare the box-score data of quarterbacks. The data are sourced from espn.com and are all from 2019. We exclude players who did not have sufficient playing time. In total, we have 46 players in our data set. These are the covariates we investigate. We omit physicals and intangibles such as age and combine measurements, as our main goal is to evaluate similarities in performance, not just similarities in physical traits, which are more immediately obvious.

1. ATT = Attempts/Game 2. CMPperc = Completion percent

3 Demetrios Papakostas Quarterback Comparison

3. AVG = Average Yards/Attempt

4. YDSG = Yards per game

5. TD = per game

6. INT = per game

7. SACK = Sacks per game

8. SYL = Sack yards lost per game

9. QBR = ESPN quarterback ratings (supposedly an encompassing stat, is somewhat qualitative, so perhaps a good metric for performance that is not directly dependent on other counting stats)

10. rushydspergame = Quarterback rush yards per game

2.2 About the Quarterback data set

The data are sourced from espn.com, which tabulates data from the boxscores of every game. All the data were from 2019. Two datasets were compiled. One is traditional box-score stats and attributes of quarterbacks. These are done for early analyses for exploratory data analysis, that we explore in section[3.1]. The second data set is more pertinent to the causal question at hand. This includes data that accounts for the “confounding” from figure1, at least under the ignorability assumption.

Some of the ratings in this dataset are somewhat subjective, such as coach-rate, Oline- help, and wr-help. These were determined by the author. A caveat, however. To compensate for the relative subjectivity in these ratings, it could be argued to include SACKS/game and SYL/game (sack yards lost). While it is certainly arguable that these statistics reflect more on a quarterback then their system, and thus should not be included in the counter-factual, we could also argue that including these additional covariates gives further credence to the more subjective ratings for offensive line and wide receiver help, as well as another gauge of the type of offensive system the quarterback is asked to operate. We choose not to include these covariates in our leave one out cross validation analysis. However, if there is issue with using these covariates, the accompanying R-shiny app allows the user to choose which covariate, or subset of covariates, to include in the study. In this setting, we include the sack data as additional covariates. Additionally, the user is allowed to filter which observations they choose, as well as the option to edit all the covariates. For example, if one were to disagree with the rating of the coach for a certain player, then they can edit the rating before running the modeling step.

We do not include any statistics that are used to calculate the statistics themselves. This would be misleading as the BART model would likely (and correctly) predict the response

4 Demetrios Papakostas Quarterback Comparison score very well, which defeats the purpose of trying to create a counter-factual. For this reason, we exclude these covariates.

Another issue is that some of these covariates may in fact be influenced by the quarter- back. That is, we may be violating the assumption of the direction of the arrow in figure1. For example, Tampa Bay has a fairly strong defense, but Jameis Winston gave his defense bad field position with record turnovers. A better statistic would help potentially remedy this.

1. coachyrs = how many years the head coach has been coaching

2. coachrate = coach rating, categorical A-F scale

3. wrhelp = how good ball catchers are, also categorical A-F scale

4. oline = how good player's offensive line is at pass protection, also categorical A-F scale

5. teamrpg = how many yards per game player's team averages from rushing the ball

6. teampagame = how many points per game player's defense gives up

7. sos = strength of schedule, the win percentage of a players teams opponents

The response variables we choose to investigate are QBR and passer rating. The two statistics are relatively similar overall, as indicated by figure2. This is somewhat surpris- ing, due to the more subjective nature of the QBR statistic tabulation. However, there are some players who struggle in passer rating who perform better in QBR, and vice versa.

140

140

120 120

100 100

RTG 80 RTG 80

60 60

40 40

0 20 40 60 80 100 0 20 40 60 80 100 QBR QBR

Figure 2: Comparing Passer rating vs QBR. Left: A kernel density estimate (kde) for both the marginals and the joint distribution. On right, a direct comparison of the two, fit with a linear regression and a 95% CI. Correlation of 0.835.

5 Demetrios Papakostas Quarterback Comparison

3 Methodology

3.1 Early Exploration Stage: Comparison of Players

In this section we use the data tabulated in section[2.1] to compare different players. Here we explore how quarterbacks are indeed related. We first deploy of a k-means clustering analysis. Using the algorithm of Hartigan and Wong[1979], we investigate how similar in general quarterbacks are based on their boxscore stats. The goal of k-means clustering is to © ª cluster our n quarterbacks into k clusters. That is, if our set of quarterbacks is Q qi , i © ª = = 1,..., n and our clusters our C ck, k 1,...,K , where K is defined by the researcher. If we = = let µk be the mean of a cluster ck, the goal of k-means is to minimize the sum of the squared error over all K clusters, i.e. K X X ° °2 min °qi µk° k 1 qi ck − = ∈ where µk is the mean of the kth cluster. However, there are some issues. In general, “the discrete clustering problem”, of which the k-means method falls into, is NP hard even for k 2, the smallest number of , and thus can be a computational hurdle1Drineas et al.[1999]. = Similarly, the parameters of k-means can be problematic. The three parameters are which clusters to initialize, how many clusters to have, and which distance metric to min- imize over. Tibshirani et al.[2001] offer some tools and tricks for how to choose these pa- rameters appropriately. While useful, choosing parameters still remains an issue when employing a k-means clustering analysis. Jain[2010] provides further detail into potential issues one may encounter when employing a k-means analysis.

We also present an affinity propagation technique to compare the quarterbacks to one another. Affinity propagation is too a clustering algorithm, like k-means, but differs in many ways. For one, affinity propagation is built off the idea that data can be better character- ized by measures of pairwise similarities rather than minimizing some Euclidian distance metric Dueck[2009]. In this sense, clusters are identified by “exemplar” data points, not do- main parameters. Additionally, this method does not necessitate the need to pre-determine the number of clusters necessary, a common drawback of k-means clustering Jain[2010]. Affinity propagation offers flexibility over other clustering by allowing the clustering algo- rithm to be agnostic of how similarities between data points are computed. In this sense, the affinity propagation is similar to k-mediod algorithms, but considers all data points as potential exemplars. Dueck[2009] shows affinity propagation to perform better in certain computer vision classification.

1Not an issue in this study do to small dataset.

6 Demetrios Papakostas Quarterback Comparison

3.2 Causal Section

To determine the effect of placing an “replacement” quarterback2, we must have a flexible regression model to determine how much performance is impacted by non-quarterback fac- tors. A first response might be to try to use modern non-parametric regression techniques that can more strongly model the effect of additional covariates and their interactions. One such model is BART, which stands for Bayesian additive regression trees, and was intro- duced in Chipman and McCulloch[2010]. BART has been used in sports analytics in the past. McCulloch and Abrevaya[2014] used BART to analyze whether in the NHL whether a penalty being called on one team makes it more likely for the next team to be called on the opposing team.

3.2.1 BART: A brief overview

BART, Bayesian additive regression trees, is at its core a sum of tree model. We begin with being a p dimensional vector of covariates x and a continuous response variable D BART − can be expressed as D f (x) ε, ε N(0,σ2) (1) = + ∼ where f (x) E(D x) is approximated by a sum of m regression trees, i.e. m(x) h(x) PL = | ≈ = l 1 ql(x). See figure3 for an example regression tree. The tree consists of splitting rules = (i.e. x1 0.8 in3), terminal nodes/leaves (i.e. ml1 in figure3). BART expands on the sum < of tree model by promoting trees that are small and the terminal mode parameters shrunk toward zero. This is done by setting prior probability that non-terminal nodes at depth d β split to favor small trees, with the probability specified as α(1 d)− , α (0,1), β [0, ), + ∈ ∈ ∞ first described in [Chipman et al., 1998]. Specifically, the default prior from Chipman and McCulloch[2010] sets α 0.95 and β 2, which yields probability of trees with 1,2,3,4, and = = over 5 terminal nodes probabilities of 0.05,0.55,0.28,0.09, and 0.03[Hill et al., 2019]. The default implementation of BART Chipman and McCulloch[2010] chooses which variables to split and where to split uniformly at random.

2 Additionally, the terminal node parameters are assigned priors, mlb N(0,σ ), where ∼ µ σ 0.5/(kpL), where L is the number of trees and k is set to k 2 in Chipman and McCul- µ = = loch[2010], which implies a 95% prior probability E(D x) is between 0.5 and 0.5. Thus, | − this prior shrinks mlb parameters towards zero, making individual tree effects small. For the σ2 prior, from equation (1), BART assigns an inverse chi-square distribution, σ2 νλ/χ2, ∼ ν where the hyper-parameters are also guided from the observed data. BART uses an MCMC algorithm to compute the posterior given the prior parameters described in Chipman and McCulloch[2010], and also Hill et al.[2019] and Tan and Roy[2019].

2The replacement is determined by assigning covariate weights by training our BART model on the rest of the quarterbacks in the league.

7 Demetrios Papakostas Quarterback Comparison

x1 0.8 < no yes ml2 x2 ml1 ml1 x2 0.4 0.4 < no yes ml3

ml2 ml3 0.8 x1

Figure 3: (Left) An example binary tree, with internal nodes labelled by their splitting rules and terminal nodes labelled with the corresponding parameters mlb. (Right) The corresponding partition of the sample space and the step function. Figure from [Hahn et al., 2020].

3.2.2 Machine Learning Causal Inference: Where we stand

Hill[2011] suggests using BART for causal inference by including the binary treatment indicator as a covariate in the estimating function f ( ). The Hill paper highlights the BART · algorithm’s flexibility in settings with linear and non-linear data generation, as well as the ability for heterogenous treatment effects to be studied3. Specifically, Hill[2011] is interested in the case where the researcher has access to a rich enough set of covariates to control for confounding4. In this setting, the bivariate probit’s model specifications greatly limit its practicality, whereas a flexible model like BART will likely shine. In fact, Hill[2011] argues that under the ignorability assumption that

³ 1 ´ ¡ ¢ E D X xi E Di Ei 1, X xi f ∗(1, xi) i | = = | = = = ³ 0 ´ ¡ ¢ E D X xi E Di Ei 0, X xi f (0, xi) i | = = | = = = where f (0, x) is the BART estimate of f for the case of no treatment (k 0). That is, it = is our estimate of how a player given the situation of a certain team should perform. f ∗ indicates that this is what we in fact observe, as we assume the treatment is for a teams system to have a certain quarterback with their specific skill-set and deficiencies. What we estimate is what would happen if a hypothetical replacement quarterback was placed in the circumstances of the player who was removed from the training data set. The hypotheti- cal quarterbacks performance is determined by training on the rest of the league. In this potential outcome framework, BART estimates of the conditional average treatment effect (CATE) are given by

n n 1 X ³ 1 ´ ³ 0 ´ 1 X CATE E Di xi E Di xi f ∗(1, xi) f (0, xi) (2) = n i 1 | − | = n i 1 − = = 3By heterogeneous treatment effects, we refer to different units. or strata of units, in a dataset responding differently to treatment [Athey and Imbens, 2015]. In particular, these differences can be because of interac- tion between different features, which is of interest to study, but difficult to model. 4This is the first half of the strong ignorability assumption.

8 Demetrios Papakostas Quarterback Comparison

Hill[2011] shows BART’s performance in both simulated data, as well as with real data 5. We use BART to estimate f (1, xi). Although BART provides a full posterior of estimates, we simply take the mean of the posterior estimates and use that number for our estimate of the projected metric (QBR or passer rating) that is to be predicted for a replacement quarterback given the circumstances of the missing player.

We have mentioned multiple times how we want to remove the player of interest when evaluating their impact as a quarterback. To do so, we incorporate leave one out cross val- idation (LOOCV) for our dataset. Part of the reason for this is, despite having a decent number of predictor variables (p), we lack a robust number of observations (n). Therefore, fitting the model with all the data points is prone to overfitting (cite). Our LOOCV is imple- mented by removing one quarterback at a time, train our regression model on every other quarterback, and then predict the removed quarterbacks output giving the trained weights and the removed covariates. We repeat this for every quarterback in the sample.

In this specific example, LOOCV has obvious interpretability advantages. For example, if we remove Patrick Mahomes from our dataset, then we can train our model on all the other quarterbacks who took significant snaps in 2019. Assuming our model picks up the trends well, we now have a pretty good idea as to what leads to success. We then can predict how someone placed in the same situation as Mahomes would perform, without Mahomes being in the training set to potentially influence this result. In the case of someone like Patrick Mahomes, this methodology seems more important, as he is an exceptional player who could influence the training data significantly, especially an issue with so few observations.

5Hill used data from an experiment that was “turned into” an observational study by discarding a non- random portion of the treatment group.

9 Demetrios Papakostas Quarterback Comparison

4 Results

4.1 Comparison Analysis

We first begin with our k-means analysis. We choose K 2, mainly as a result of trial and = error. Results in figure5. Our affinity propagation analysis is presented in figure6. We also present the results for a specific player that help motivate our entire rationale behind the causal analysis.

In figure4, we look at the three most similar players to Seattle Seahwaks quar- terback Russell Wilson, widely considered a top player in the league by 2019, but of- ten considered to be held back by his team and system6. The first two players Wilson is most similar to are probably reasonable comparisons based on style of play and tal- ent. However, the third is Derek Carr, who is not a very similar player, but one who plays in a similarly conservative run-first offense, and thus the comparison is likely Figure 4: looking at the 3 most similar players due more to extraneous circumstances out- to Russell Wilson in the 2019 season, which are side of both players control. Deshaun Watson, Carson Wentz, and Derek Carr.

Matt Moore

4

Teddy Bridgewater

Josh Rosen Case Keenum Devlin Hodges 2 Drew Lock Jacoby BrissettNick Foles Lamar Jackson Mason Rudolph KirkDerek Cousins Carr Patrick Mahomes Josh Allen cluster Ryan Tannehill MarcusDwayne Mariota Haskins Jr a 1 TomGardner Brady Minshew II

Dim 2 Mitchell Trubisky Brandon Allen 2 0 Aaron Rodgers a Dak Prescott Eli CarsonManning Wentz Ryan Fitzpatrick Russell WilsonSam DarnoldJoe Flacco Deshaun Watson Kyler Murray David Blough Ryan Finley Philip Rivers Baker Mayfield Daniel Jones Cam Newton Andy DaltonJeff Driskel -2 Matt Ryan

Kyle Allen

Jameis Winston

-2.5 0.0 2.5 5.0 Dim 1

Figure 5: K 2. The clusters seem to do a pretty good job of clumping most quarterbacks together, = as well as identifying “2nd string” and very early career quarterbacks in cluster 1.

6In this respect, if early returns are to hold, the 2020 NFL season seems to corroborate this theory.

10 Demetrios Papakostas Quarterback Comparison

Affinity Propagation Distances

Ryan Finley Brandon Allen Josh Rosen Cam Newton Matt Moore Nick Foles David Blough Drew Lock Eli Manning Devlin Hodges Marcus Mariota Dwayne Haskins Jr Teddy Bridgewater Case Keenum Mason Rudolph Joe Flacco Matthew Stafford Ryan Tannehill Jacoby Brissett Drew Brees Sam Darnold Daniel Jones Josh Allen Lamar Jackson Mitchell Trubisky Gardner Minshew II Kyle Allen Andy Dalton Ryan Fitzpatrick Kirk Cousins Kyler Murray Baker Mayfield Deshaun Watson Jimmy Garoppolo Aaron Rodgers Patrick Mahomes Carson Wentz Derek Carr Tom Brady Russell Wilson Matt Ryan Philip Rivers Jared Goff Dak Prescott Jameis Winston Kyle Allen Matt Ryan Josh Allen Nick Foles Jared Goff Drew Lock Joe Flacco Tom Brady Derek Carr Jeff Driskel Matt Moore Drew Brees Eli Manning Ryan Finley Josh Rosen Andy Dalton Philip Rivers Kirk Cousins Kyler Murray Dak Prescott Cam Newton Daniel Jones Sam Darnold David Blough Brandon Allen Carson Wentz Case Keenum Devlin Hodges Russell Wilson Baker Mayfield Ryan Tannehill Aaron Rodgers Jacoby Brissett Lamar Jackson Marcus Mariota Mason Rudolph Jameis Winston Ryan Fitzpatrick Mitchell Trubisky Matthew Stafford Deshaun Watson Patrick Mahomes Jimmy Garoppolo Teddy Bridgewater Dwayne Haskins Jr Gardner Minshew II

Figure 6: Affinity propagation distance between the quarterbacks in the set based on their tradi- tional counting stats.

Player ranks Passer Rating Player ranks QBR Matthew Stafford Ryan Fitzpatrick Ryan Tannehill Matthew Stafford Drew Brees Lamar Jackson Ryan Fitzpatrick Patrick Mahomes Russell Wilson Russell Wilson Kirk Cousins Drew Brees Case Keenum Dak Prescott Derek Carr Kyler Murray Matt Ryan Matt Ryan Kyler Murray Derek Carr Lamar Jackson Deshaun Watson Drew Lock Daniel Jones Patrick Mahomes Ryan Tannehill Gardner Minshew II Carson Wentz Andy Dalton Philip Rivers Aaron Rodgers Joe Flacco Philip Rivers Case Keenum Sam Darnold Drew Lock Joe Flacco Andy Dalton Matt Moore Kirk Cousins Daniel Jones Jameis Winston Deshaun Watson Baker Mayfield Nick Foles Sam Darnold Kyle Allen Gardner Minshew II Jimmy Garoppolo Tom Brady players Carson Wentz players Aaron Rodgers Mitchell Trubisky Josh Allen Jameis Winston Jeff Driskel Josh Allen Matt Moore Eli Manning Jacoby Brissett Dak Prescott Jimmy Garoppolo Jared Goff Kyle Allen Teddy Bridgewater Mitchell Trubisky Dwayne Haskins Jr Brandon Allen Jacoby Brissett Jared Goff Tom Brady Nick Foles Mason Rudolph Eli Manning Marcus Mariota Teddy Bridgewater Baker Mayfield Devlin Hodges Jeff Driskel Ryan Finley Devlin Hodges Dwayne Haskins Jr Cam Newton Mason Rudolph Ryan Finley David Blough Brandon Allen Marcus Mariota David Blough Cam Newton Josh Rosen Josh Rosen -30 -20 -10 0 10 20 -20 0 20 difference above expected difference above expected

Figure 7: Left: The difference between observed passer rating and expected passer rating. Right: Same plot, but for the QBR stat from ESPN.

11 Demetrios Papakostas Quarterback Comparison

120 70 Correlation=0.34

110 Correlation=0.73 60 100

90 50

80 predictedRTG predictedQBR 40 70

60 30

50 20 40 60 80 100 120 140 0 20 40 60 80 100 actualRTG actualQBR

120 70

110

60 100

90 50

80 predictedRTG predictedQBR 40 70

60 30

50 20 40 60 80 100 120 140 0 20 40 60 80 100 actualRTG actualQBR

Figure 8: Top left: Comparing our predicted passer rating vs actual, top right is predicted QBR vs actual. Linear regression fit to both with 95% CI. Bottom plots are the same but with KDE estimates of marginal and joint distributions. Rating data mostly lines up, with some exceptions. QBR predictions are more varied.

12 Demetrios Papakostas Quarterback Comparison

name actual RTG predicted RTG diff RTG rank RTG actual QBR predicted QBR diff QBR rank QBR Aaron Rodgers 95.40 89.83 5.58 16 50.80 52.18 -1.38 27 Andy Dalton 78.30 72.67 5.63 15 39.80 36.15 3.65 19 Baker Mayfield 78.80 88.80 -10.00 39 52.30 49.60 2.70 21 Brandon Allen 68.30 86.33 -18.03 44 38.60 46.29 -7.69 35 Cam Newton 71.00 84.98 -13.98 42 21 47.10 -26.10 45 Carson Wentz 93.10 95.75 -2.65 26 60.80 53.60 7.20 14 Case Keenum 91.30 78.39 12.91 7 43.90 39.42 4.48 17 Dak Prescott 99.70 104.03 -4.33 31 70.60 56.19 14.41 7 Daniel Jones 87.70 84.91 2.79 21 53.90 43.12 10.78 12 David Blough 64.00 88.85 -24.85 45 32.40 52.77 -20.37 43 Derek Carr 100.80 88.08 12.72 8 62.40 49.88 12.52 10 Deshaun Watson 98.00 95.93 2.07 22 68.90 56.44 12.46 11 Devlin Hodges 71.40 85.38 -13.97 41 30.10 45.15 -15.05 39 Drew Brees 116.30 98.77 17.53 3 71.80 56.20 15.60 6 Drew Lock 89.70 80.73 8.97 12 48.30 44.07 4.23 18 Dwayne Haskins Jr 76.10 82.66 -6.56 34 26.90 43.84 -16.94 41 Eli Manning 82.60 86.58 -3.98 30 34.90 48.31 -13.41 37 Gardner Minshew II 91.20 83.41 7.79 14 42.90 42.31 0.59 24 Jacoby Brissett 88.00 95.31 -7.31 35 50.50 54.68 -4.18 30 Jameis Winston 84.30 87.70 -3.40 28 53.90 51.53 2.37 22 Jared Goff 86.50 91.22 -4.72 32 48.30 55.85 -7.55 34 Jeff Driskel 75.30 86.69 -11.39 40 47.70 50.02 -2.32 28 Jimmy Garoppolo 102.00 103.43 -1.43 25 58.80 63.33 -4.53 31 Joe Flacco 85.10 81.71 3.39 19 49.00 44.11 4.89 16 Josh Allen 85.30 89.23 -3.93 29 47.50 48.67 -1.17 26 Josh Rosen 52.00 82.49 -30.49 46 18.30 47.28 -28.98 46 Kirk Cousins 107.40 93.30 14.10 6 58.70 55.22 3.48 20 Kyle Allen 80.00 81.21 -1.21 24 36.70 42.43 -5.73 32 Kyler Murray 87.40 77.91 9.49 10 55.50 41.48 14.02 8 Lamar Jackson 113.30 103.96 9.34 11 81.70 58.79 22.91 3 Marcus Mariota 92.30 101.70 -9.40 38 33.60 59.56 -25.96 44 Mason Rudolph 82.00 90.81 -8.81 37 34.10 53.45 -19.35 42 Matt Moore 100.90 97.71 3.19 20 56.40 60.26 -3.86 29 Matt Ryan 92.10 80.93 11.17 9 58.10 44.46 13.64 9 Matthew Stafford 106.00 79.75 26.25 1 70.20 46.06 24.14 2 Mitchell Trubisky 83.00 86.24 -3.24 27 40.00 46.92 -6.92 33 Nick Foles 84.60 85.35 -0.75 23 34.10 44.50 -10.40 36 Patrick Mahomes 105.30 96.42 8.88 13 76.40 55.39 21.01 4 Philip Rivers 88.50 84.03 4.47 17 48.90 43.74 5.16 15 Russell Wilson 106.30 90.82 15.48 5 69.80 49.21 20.59 5 Ryan Finley 62.10 77.99 -15.89 43 23.40 40.31 -16.91 40 Ryan Fitzpatrick 85.50 69.32 16.18 4 66.60 31.37 35.23 1 Ryan Tannehill 117.50 95.12 22.38 2 62.50 53.05 9.45 13 Sam Darnold 84.30 80.63 3.67 18 43.50 41.32 2.18 23 Teddy Bridgewater 99.10 104.88 -5.78 33 48.60 62.68 -14.08 38 Tom Brady 88.00 96.80 -8.80 36 54.00 53.73 0.27 25

Table 1: Ranking all the quarterbacks based on their actual passer rating minus their expected passer rating (RTG), as well as for QBR, the statistic tabulated by ESPN. Strength of schedule as a variable proves to matter quite a lot. For example, Ryan Tannehill and Marcus Mariotta played on the same team, but Tannehill played a much more difficult schedule.

13 Demetrios Papakostas Quarterback Comparison

5 Discussion

Our causal analysis seems to corroborate what many think, specifically with regard to the QBR statistic. While there are some outliers (such as Case Keenum). Here is the accom- panying R-shiny app with this dataset. The app allows the user to edit the covariate set, filter out players, use different models, and change modeling parameters. The app also al- lows users to upload their own data, for example if they want to use this type of data for a different year.

We do not claim our methodology says who is the best quarterback in football, but rather who performed best under the circumstances they were placed in. For example, Ryan Fitz- patrick was not the best quarterback in 2019, but performed very well given the situation he was placed in. This makes sense, as Fitzpatrick is very experienced and is often consid- ered a system agnostic quarterback. He can be successful on many teams. He has, however, never proven to be able to take a good team and make them great. Therefore, he will prob- ably be roughly the same player regardless of the team/situation he is in, and therefore his great performance is more the result of being in a tough situation and playing pretty well. Similarly, a player like Patrick Mahomes is undeniably a top tier player. However, many quarterbacks could be successful in the situation he is in, which our model docks Mahomes for in a sense. However, at the upper ends of QBR and RTG, any improvement becomes increasingly difficult.

References

S. Athey and G.W. Imbens. Recursive paritioning for heterogeneous causal effects. arXiv preprint, 2015. URL https://arxiv.org/abs/1504.01132v3.8

Kerry Byrne. In defense of passer rating. Sports Illustrated, August 3, 2011. URL https: //www.si.com/more-sports/2011/08/03/defending-qb-rating.2

G. Chipman and R. McCulloch. Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298, 2010.7

H. Chipman, E. George, and R. McCulloch. Bayesian cart model search. Journal of the American Statistical Association, 93(443):935–948, 1998.7

P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering large graphs via the singular value decomposition. Machine Learning, 56(1):9–33, 1999.6

D. Dueck. Affinity propagation: clustering data bypassing messages. PhD thesis, University of Toronto, 2009.6

P.R. Hahn, J.S. Murray, and C. Carvalho. Bayesian regression tree models for causal infer- ence: Regularization, confounding, and heterogeneous effects. Bayesian Analysis, pages 1–64, 2020. URL https://arxiv.org/abs/1706.09523.8

14 Demetrios Papakostas Quarterback Comparison

J.A. Hartigan and M.A. Wong. Algorithm as 136: A k-means clustering algoritm. Applied Statistics, 28:100–108, 1979. doi: https://doi.org/10.2307/2346830.6

Hill, Linero, and Murray. Bayesian additive regression trees: A review and look forward. Annual Review of Statistics and its Applications, 2019.7

J. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.8,9

P. Holland. Statistics and causal inference. Journal of the American Statistical Association, 81(396):945–960, 1986.2

G. Imbens and DB. Rubin. Causal Inference in Statistics, and in the Social and Biomedical Sciences. Cambridge University Press, New York, 2006.2

A.K. Jain. Clustering methods: A history of k-means algorithms. Elsevier, 31:651–666, 2010.6

S. Katz and B. Burke. How is total qbr calculated? we explain our quar- terback rating. 2017. URL https://www.espn.com/blog/statsinfo/post/_/id/123701/ how-is-total-qbr-calculated-we-explain-our-quarterback-rating.1

R. McCulloch and J. Abrevaya. Reversal of fortune: a statistical analysis of penalty calls in the national hockey league. Journal of Quantitative Analysis in Sports, 10(2):207–224, 2014.7, 16

J Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, Cam- bridge, England, 2000.2

D. B. Rubin. Estimating causal effects of treatments in randomized and non randomized studies. Journal of Educational Psychology, 66(5):688–701, 1974.2

E. Santos-Fernandez, P. Wu, and K.L. Mengersen. Bayesian statistics meets sports: a com- prehensive review, 2019.2

Y.V. Tan and J. Roy. Bayesian additive regression trees and the general bart model. arXiv preprint, 2019. URL https://arxiv.org/pdf/1901.07504.pdf.7

R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic. J.R. Statistics, 63(2):411–423, 2001.6

R. Yurko, S. Ventura, and M. Horowitz. nflwar: a reproducible method for offensive player evaluation in football. Journal of Quantitative Analysis in Sports, 5(3):163–183, 2019. URL https://doi.org/10.1515/jqas-2018-0010.1

15 Demetrios Papakostas Quarterback Comparison Appendices

Below is an example of a single regression tree fit to data from the “causal section” of the paper. While BART is not simply fitting one regression tree, it is useful to fit a regression tree to see roughly what exogenous variables are predictive of quarterback performance. A similar idea was used in McCulloch and Abrevaya[2014]. This tree is fit with all the data, not a LOOCV exercise. This tree is just meant as an intuitive, visual aid to see how a tree would predict performance.

RegressionRegressionRegression Tree Tree Tree for for estimatingestimating estimating RTG RTG RTG RegressionRegressionRegression Tree Tree Tree for for estimatingestimating estimating QBR QBR QBR

yes oline >= 3 no 50 n=46 yes oline >= 3 no 89 n=46 wrhelp >= 2

44 n=29 teampagame >= 26

83 n=29 teampagame >= 26

42 n=22

79 87 99 n=14 n=15 n=17

38 45 53 59 n=11 n=11 n=7 n=17

Figure 9: Left: Regression tree for estimating passer rating (RTG). Right: Estimating QBR. Note, the grade scale A-F translates to 1-5. Therefore, oline>=3 being true means quarterbacks with offen- seive lines of C,D, or F grades.

16