More on Defensive Regression (Or Runs) Analysis 7
Total Page:16
File Type:pdf, Size:1020Kb
More on Defensive A Regression (or Runs) Analysis Th is appendix has three primary objectives: fi rst, to disclose aspects of DRA not disclosed in chapter two; second, to address aspects of the model that raise issues related less to baseball per se than to statistical modeling in gen- eral; and third, to drive home the fundamental point that DRA is not an answer, but a method. Included in this appendix are certain alternative models I tried, and suggestions for further improvements, which should provide some sense of the range of alternative approaches that are possible. DRA POST-1951 Overview Th ere are essentially two DRA models: post-1951 and pre-1952. Th e post- 1951 model uses a subset of Retrosheet play-by-play data currently available for seasons aft er 1951, and was almost completely described in chapter two. Th e pre-1952 model must make do with considerably less data, which ren- ders it more primitive for infi elders and unavoidably more complicated for outfi elders. When we fi rst began explaining DRA, we took a ‘bottom-up’ approach, starting from the shortstop position and gradually building up until we had a team model. Here we’ll take a ‘top-down’ approach, revealing the entire post-1951 team model all at once, and then discussing its components. Likewise, we’ll start with a top-down discussion of the pre-1952 model. Th e following page presents the entire post-1951 model on one page, with a glos- sary of defi ned terms on the facing page. 3 AAppendix-A.inddppendix-A.indd 3 22/1/2011/1/2011 22:27:53:27:53 PPMM AAppendix-A.indd 4 p p e n d i x - A . i n d d 4 1952–2009 DRA Model Team defensive runs saved above or below the league rate, given innings pitched, DR.ip , is estimated as the sum of pitching, catching, infi eld, and outfi eld defensive runs: Pitching = .27 * SO.bfp – .34 * BB.bfp – 1.49 * HR.bh + .42 * A1.bip + .44 * IFO.bip – .56 * WP.ip . Catching = .59 * CS.sba + .59 * GO2.bip . Infi eld = .52 *rGO3 + .53 * rA4 + .45 * rA5 + .44 * rA6 . Outfi eld = .53 *rPO7 + .46 * rPO8 + .44 * rPO9 + .61 * A7.ip + .61 * A8.ip + .61 * A9.ip . All ‘plain’ variables are team seasonal totals. See defi nitions on facing page. All variables with a ‘dot’, for example, A6.bip , are calcu- lated in the same way: A6.bip = A6 – [ A6 * ( BIP \ league BIP )]. A6.bip equals total A6 recorded by the team above (if negative, below) the league average rate that year, given total team BIP ‘opportunities’. Th e ‘opportunities’ variable following the ‘dot’ is always in lower case letters. All variables beginning with an “ r ” are residual team plays that year; that is, estimated net plays taking into account available predictors, using regression analysis. rGO3 = GO3.bip + .09 * RBIP.bip. rA4 = A4.bip + .08 * RBIP.bip + .15 * RFO.rbip + .32 * LFO.lbip + .18 *HR.bh + .20 * WP.ip + .19 * SH.bip. rA6 = A6.bip − .06 * RBIP.bip + .29 * RFO.rbip + .15 * LFO.lbip + .12 * HR.bh + .56 * WP.ip + .43 * SH.bip. rA5 = A5.bip – .10 * RBIP.bip + .21 * RFO.rbip + .10 * LFO.lbip + .15 * A1.bip + .13 * rGO3 + .13 * IBB.pa. rPO7 = PO7.bip + .03 * RBIP.bip + .21 * RGO.rbip + .10 * LGO.lbip. rPO8 = PO8.bip − .01 * RBIP.bip + .27 * RGO.rbip + .24 * LGO.lbip + .07 * IFO.bip + .20 * SH.bip. rPO9 PO9.bip RBIP.bip RGO.rbip LGO.lbip . = − .03 * + .22 * + .22 * + .12 * IFO.bip 22/1/2011 2:27:53 PM elding runs to individual (lower-case “ i ”) fi elders: Example of allocation of team fi / 1 / 2 0 1 iA6 runs rA6 * ( iIP \ IP ) + .44 * [ iA6 − A6 * ( iIP \ IP )]. = + .44 * 1 2 : 2 7 : 5 3 P M AAppendix-A.indd 5 p p e n d i x - A . i n d d 5 Defi nitions of Team-Level Variables for DRA Model (1952–2009) Abbrev. Defi nition Formula or Source Abbrev. Defi nition Formula or Source 1 … 9 Pitcher ... Right Fielder LFO L e ft -handed batter FO play-by-play data A Assists (total, if not followed by a number) LGO L e ft -handed batter GO play-by-play data BB Unintentional BB + HBP UBB + HBP OA Outfi elder-only A sum(A7,A8,A9 ) BFP Batters Faced by Pitchers PA - IBB OPO Outfi elder-only PO sum(PO7,PO8,PO9 ) BH Balls Hit BFP - SO - BB PA Plate Appearances BIP Balls In Play BH - HR PB Passed Balls CS Caught Stealing PO Putouts (total, if not followed by a Number) FO Fly Outs (total) RFO + LFO RBIP Right-handed batter BIP play-by-play data GO Ground Outs (total) RGO + LGO RFO Right-handed batter FO play-by-play data GO2 GO at catcher A2 - CS RGO Right-handed batter GO play-by-play data GO3 GO at fi rst base A3 + UGO3 SBA Stolen Base(“ SB ”) Attempts SB + CS HBP Hit By Pitch SH Sacrifi ce Hits HR H o m e R u n s SO Strikeouts IA I n fi elder-only Assists sum(A1,A2, . ,A6 ) UBB Unintentional BB BB (traditional) - IBB IBB Intentional Bases on Balls BB – UBB UGO3 Unassisted GO3 avg( UGO3e1 , UGO3e2 ) IFO I n fi elder-only FO FO - OPO UGO3e1 UGO3 estimate #1 IPO - A - IFO IP Innings Pitched (or Played) UGO3e2 UGO3 estimate #2 GO - IA - CS - GIDP IPO I n fi elder-only PO sum(PO1,PO2, . PO6 ) WP Wild Pitches (includes PB ) WP (traditional) + PB 22/1/2011 2:27:53 PM / 1 / 2 0 1 1 2 : 2 7 : 5 3 P M 6 APPENDIX A Th e previous two pages are a bit much to take in all at once. But I do not believe that any other comprehensive system for team and individual defense remotely as accurate as DRA can be summarized as concisely. Before addressing the new points, let’s quickly recap in a few pages the basic approach under DRA as described in chapter two. You might fi nd it helpful to fl ip back to the preceding two pages as you read both the recap and the discussion of new issues. DRA is essentially a forced-zero-intercept, two-stage multivariable least- squares regression analysis model. I’m using the “two-stage” terminology informally; as we shall see, the DRA model is not an “instrumental vari- ables” model, otherwise known as a “two-stage least-squares” model. Th e forced zero intercept merely means that we ‘center’ the ultimate out- come being predicted (team runs allowed), each ‘play made’ (each pitching and fi elding ‘play’ that is made) outcome used to predict expected team runs allowed, and each variable used to ‘predict’ expected pitching and fi elding plays, so that all outcomes and their respective ‘predictors’ are net numbers, above or below the league-average rate. Furthermore, each outcome or pre- dictor is centered by reference to its appropriate ‘denominator’ of opportuni- ties (the ‘denominators’ are not literally used as denominators in the arithmetical sense; hence the quotation marks). Th e fi rst stage of regression analysis involves regressing ‘centered’ fi elding variables ‘onto’ centered variables not under the control of the fi elding posi- tion being evaluated (and ideally not infl uenced by the quality of other fi eld- ers) that tend to be associated with more or fewer fi elding plays at that position. Th e residual left over from each fi rst-stage regression at each posi- tion is treated as an estimate of the ‘skill’ plays made at that position above or below expectation. Th e second-stage regression involves regressing net team runs allowed onto net pitching and (fi rst-stage-regression-adjusted) fi elding plays in order to reveal the number of runs associated with each net pitching and (fi rst- stage-regression-adjusted) fi elding outcome. To rate a team at a position, you simply apply the run weight determined in the second-stage regression to the net plays (which, again, are negative half the time) to determine defensive runs at that position. Finally, you allocate team defensive runs at that position to each player fi rst pro-rata, based on his innings played at that position, then calculate his net plays compared to the team rate, given his percentage of team innings played. Each net play is credited with the same run weight used for the team rating at that position. AAppendix-A.inddppendix-A.indd 6 22/1/2011/1/2011 22:27:53:27:53 PPMM More on Defensive Regression (or Runs) Analysis 7 Centering The Variables By Their Respective ‘Denominators’ We center all the team variables by their respective ‘denominators’ of opportu- nities. Centering in this way is the fi rst step towards making each variable less correlated with the others, so that its ‘independent’ net impact in runs may be better estimated. Th e little quotation marks are there because we will not achieve true independence in a mathematically precise sense. Th e best ‘denominator’ of opportunities for the ultimate outcome we’re trying to model — actual total team runs allowed per season — is innings pitched, so we calculate team runs allowed above or below the league-average rate given the team’s innings played, that is, net runs allowed given innings played, or RA.ip . In some sense this is just denominating net runs allowed by total outs, as innings are defi ned by outs. Th is is correct, because the ultimate limit on the number of runs a team can score in an inning is defi ned by outs.