<<

Product Recall Notice for ‘ Quality’

Data quality problems with the measurement of the quality of a hockey team’s shots taken and allowed

Copyright Alan Ryder, 2007 Hockey Analytics www.HockeyAnalytics.com Product Recall Notice for Shot Quality 2

Introduction

In 2004 I outlined a method for the measurement of “Shot Quality” 1. This paper presents a cautionary note on the calculation of shot quality.

Several individuals have enhanced my approach, or expressed the outcome differently, but the fundamental method to determine shot quality remains unchanged:

1. For each shot taken/allowed, collect the data concerning circumstances (e.g. shot distance, shot type, situation, rebound, etc) and outcome (, save).

2. Build a model of goal probabilities that relies on the circumstance. Current best practice is that of Ken Krzywicki involving binary logistic regression techniques 2. I now use a slight variation on that theme that ensures that the model reproduces the actual number of goals for each shot type and situation.

3. For each shot apply the model to the shot data to determine its goal probability.

4. Expected Goals: EG = the sum of the goal probabilities for each shot for a given team.

5. Neutralize the variation in shots on goal by calculating Normalized Expected Goals: NEG = EG x Shots< / Shots (Shots< = League Average Shots For/Against).

6. Shot Quality: SQ = NEG / GA< (GA< = League Average Goals For/Against).

I have called the result SQA for shots against and SQF for shots for . I normally talk about a team’s shot overall quality but one can determine shot quality any subset of its total shots (e.g. individual , individual shooters, short handed, Tuesday nights, etc.).

The way to interpret an SQA of 1.050 is that the quality of shots allowed by the team is such that it would result in 5% more goals than a team with an SQA of 1.000, all other things being equal. An SQF of 1.05 means that the quality of shots taken by the team is such that it would result in 5% more goals than a team with an SQF of 1.000, all other things being equal.

1 See http://www.hockeyanalytics.com/Research_files/Shot_Quality.pdf

2 See http://www.hockeyanalytics.com/Research_files/Shot_Quality_Krzywicki.pdf

Copyright Alan Ryder, 2007 Hockey Analytics www. HockeyAnalytics.com Product Recall Notice for Shot Quality 3

Goal Prevention

SQA gives us a measure of shot quality that is independent of the number of shots on goal and independent of goaltending.

Before shot quality measurement was conceived goal prevention looked like a simple model:

GA = SOG x (1 – SV)

We knew that Shots on Goal (SOG) was a defensive responsibility, so we have historically attributed it to the skaters on a team. We knew that Save Percentage (SV) was principally a statistic so we have historically attributed it to goaltenders. But shot quality gives us a better model:

GA = SQA x SOG x (1 – SQNSV) where SQA is the Shot Quality Allowed Index and SQNSV is the Shot Quality Neutral Save Percentage.

To calculate SQNSV observe that the two models both give us goals against. That means:

SOG x (1 – SV) = SQA x SOG x (1 – SQNSV), or (1 – SV) = SQA x (1 – SQNSV), which means SQNSV = 1 – (1- SV) / SQA

In this model we attribute both SQA and SOG to the defense and SQNSV to goaltending. Clearly SQNSV is a better measure of the goaltender’s contribution to team success than is SV. It represents the save percentage one would expect with no variation in shot quality from team to team.

As SQA and SOG are attributed to defense, it sometimes makes sense to combine then as SQNSOG (Shot Quality Neutral Shots on Goal), a kind of ‘wind chill’ calculation for goaltenders.

Goal Scoring

The same logic applies to our understanding of offense:

GF = SQF x SOG x SQNSH where SQNSH = 1 – SH / SQF

SH = Shooting Percentage but the interpretation is different. On offense SOG describes the ‘Decision’ to shoot, SQF describes the ‘Circumstances’ of the shot and SQNSH (Shot Quality Neutral Shooting Percentage) describes the ‘Execution’ of the shot attempt.

Copyright Alan Ryder, 2007 Hockey Analytics www. HockeyAnalytics.com Product Recall Notice for Shot Quality 4

SQF is a strange beast. My research has indicated that GF is highly correlated to SOG and SQNSH but nearly uncorrelated with SQF. The only way this makes sense to me is that, on offense, the circumstances of a shot are much less impactful than execution. Note that this is not the case with SQA, which explains a fair bit of goal prevention.

The Problem

I have been mainly interested in SQA. Since the time I first started looking at shot quality the have lead the league. Always. And this has concerned me. Greatly.

The problem, you see, is the way that the data is collected. In my Shot Quality paper I made the following observation:

The source of these game summaries is the NHL’s Real Time Scoring System (RTSS). RTSS scorers have a tough job to do, recording each on ice “event” and player ice time. When it comes to a shot, the scorer records the shooter, the distance and the shot type by tapping several times on a screen. The time is recorded by the system based on one of these taps. Distance is captured by a tap on a screen resembling the rink. The system calculates the distance.

All of this happens pretty quickly. The highest priority is the shooter, as this data does get summarized and published. In the heat of battle, it is easy to get the time, shot type and distance wrong. The database clearly has embedded errors. There are shots that are impossibly close together in time. There are “wrap” shots from 60 feet. There are “tip in” shots from 60 feet. There are likely to be other coding errors (slap shots coded as wrist shots). It is easy to imagine that the record of distance is off, at least by small amounts. It is easy to imagine that a “snap” shot and a “wrist” shot are frequently confused. It is easy to imagine that two different scorers would give us two different records of the same event.

I have been worried that there is a systemic bias in the data. Random errors don’t concern me. They even out over large volumes of data. I seriously doubt that the RTSS scorers bias the shot data in favour of the home team. But I do think that it is a serious possibility that the scoring in certain rinks has a bias towards longer or shorter shots, the most dominant factor in a shot quality model. And I set out to investigate that possibility.

The Study

In statistics ‘paired’ data is a powerful thing. Conduct an experiment and measure its results. Change one variable and repeat the experiment and, viola, you have paired data.

Thanks to Ken Krzywicki’s tireless efforts his 2006-07 (‘2007’) shot data set captures the home and road team. If we study shot quality for home and road games separately we have paired data and should be able to identify any systemic bias in the measurement of shot quality.

Copyright Alan Ryder, 2007 Hockey Analytics www. HockeyAnalytics.com Product Recall Notice for Shot Quality 5

SQA Shot Quality Allowed To the right is a summary of the home and road Home vs Road SQAs of NHL teams in the 2007 season. I have 2006-07 sorted the data from the highest to lowest Team Home Road Difference differential. NYR 1.314 0.982 0.332 CAR 1.122 0.984 0.138 And sure enough we have a smoking gun. On SJS 1.063 0.928 0.135 the road the look like an CBJ 1.032 0.956 0.075 WAS 1.030 0.961 0.069 average defensive team with an SQA factor of STL 1.141 1.074 0.066 .982. But at home the Blueshirts look awful PHO 1.024 0.959 0.065 with an SQA of 1.314. It is difficult to believe PIT 1.062 1.011 0.052 that the Rangers allow shots that are 33% more ANA 1.064 1.022 0.041 dangerous at home. It is more likely that RTSS OTT 1.043 1.009 0.034 MIN 0.954 0.928 0.026 scorers in the Big Apple have a certain myopia NYI 1.037 1.012 0.026 that causes them to record shot distances too DET 1.041 1.016 0.025 short. FLA 1.038 1.014 0.024 CHI 1.064 1.059 0.004 Without the help of statistical inference VAN 0.990 0.988 0.002 techniques it is difficult to know whether this is NAS 1.020 1.019 0.001 CAL 0.886 0.910 -0.024 a meaningful deviation. PHI 0.987 1.017 -0.030 BOS 0.973 1.007 -0.034 A crude test is to see if any of the data lies MTL 0.943 0.990 -0.046 outside of two standard deviations of the mean. EDM 0.917 0.968 -0.050 The standard deviation of this difference data is TOR 0.948 1.010 -0.062 0.094. This means that the Rangers data is very NJD 0.879 0.945 -0.065 ATL 1.003 1.070 -0.067 suspicious. It also means that the data from the DAL 0.922 0.992 -0.070 left coast of Florida is potentially problematic as COL 0.960 1.031 -0.071 well. BUF 0.865 0.949 -0.084 LAK 0.940 1.054 -0.114 Upon further inspection we can see that the TBL 0.854 1.042 -0.188 standard deviation of road SQAs is 0.043 whereas the standard deviation of home SQAs is 0.093. This is further evidence of a reporting bias.

When one studies measurement errors in an unbiased measurement process the errors have a normal distribution. So the next test we would want to apply is a test of normality to the difference column. The Anderson-Darling test of normality tells us that this data has about a 7% chance of being normal. This is not conclusive but suggests that there is something non-random in the home-road differentials.

Copyright Alan Ryder, 2007 Hockey Analytics www. HockeyAnalytics.com Product Recall Notice for Shot Quality 6

SQF Shot Quality For To the right is a summary of the home and road Home vs Road SQFs of NHL teams in the 2007 season. I have 2006-07 again sorted the data from the highest to lowest Team Home Road Difference differential. NYR 1.377 0.969 0.408 SJS 1.189 1.043 0.146 And there is the same smoking gun. On the STL 1.097 0.978 0.119 road the Rangers look like a below average CAR 1.071 0.952 0.118 PHO 0.971 0.918 0.053 offensive team with an SQF factor of .969. Yet MIN 0.941 0.891 0.050 at home New York looks superhuman with an ANA 1.079 1.041 0.037 SQF of 1.377. It is very difficult to believe that PIT 1.121 1.099 0.022 the Rangers take shots that are 41% more EDM 0.895 0.896 -0.002 dangerous at home. And, again, it is more CAL 1.023 1.026 -0.003 WAS 0.973 0.979 -0.006 likely that RTSS scorers in the Big Apple have BOS 0.984 0.990 -0.006 myopia (recording shot distances too short). FLA 0.884 0.896 -0.012 DET 0.984 1.000 -0.016 San Jose ranked third on the SQA list and OTT 0.991 1.007 -0.016 second here. Carolina was second on the SQA CHI 0.882 0.899 -0.017 list and fourth here. Tampa Bay was last on the COL 1.097 1.115 -0.018 CBJ 0.985 1.021 -0.036 SQA list and third last here. Buffalo flipped NJD 0.923 0.964 -0.041 this profile (third last on this list and last in NAS 0.974 1.022 -0.049 SQA). The correlation of the rankings is quite PHI 0.981 1.035 -0.054 high. TOR 0.882 0.951 -0.068 MTL 0.962 1.034 -0.073 The standard deviation of the difference data is VAN 0.936 1.022 -0.086 LAK 0.935 1.021 -0.086 0.107. This means that the Rangers data is very ATL 1.034 1.120 -0.086 suspicious. The standard deviation of road NYI 0.968 1.063 -0.095 SQFs is 0.064 whereas the standard deviation of TBL 0.940 1.052 -0.112 home SQFs is 0.106. The evidence of a DAL 0.870 0.999 -0.129 reporting bias continues to mount. The BUF 0.922 1.095 -0.172 Anderson-Darling test of normality, applied to the difference column, tells us that this data has about a 0.2% chance of being normal, the New York outlier being the principal problem. The case is closed.

Combining the Data

Each of the analyses above looks only at one end of the rink. If we add the home-road differences for SQA and SQF we can get a sense of the systemic bias.

That data is presented below and it looks systemically biased to the naked eye. The Anderson-Darling test of normality tells us that this data has about a 0.9% chance of being normal. Notice that the signs of the first two columns match almost uniformly in the top and bottom 10 teams. This is a vivid picture of reporting bias.

There is clearly a large problem here. There is good chance that reporting of up to two thirds of the teams is materially off in its measurement of shot distance or type.

Copyright Alan Ryder, 2007 Hockey Analytics www. HockeyAnalytics.com Product Recall Notice for Shot Quality 7

The principal driver of a shot quality model is shot distance (a shorter shot having a greater chance at becoming a goal). Let’s have a look at shot distance (both home and visiting teams) for certain reporting cities:

Reporting City Even Handed Power Play Short Handed Overall NY Rangers 25.9 24.0 30.5 25.6 San Jose 30.8 28.4 43.3 30.8 Carolina 32.2 33.4 33.0 32.5 St. Louis 30.6 29.8 38.8 30.7 League Average 34.1 32.8 43.9 34.2 Dallas 36.5 35.8 39.1 36.5 Los Angeles 35.7 35.6 38.0 35.7 Buffalo 34.8 33.5 45.6 35.0 Tampa Bay 33.6 33.4 35.4 33.6

It is easy to see why the shot quality reported out of Madison Square Garden is so high. Reporting Difference in Offensive and Defensive The second largest influence on shot quality is Shot Quality situation. A team that both takes and draws a 2006-07 considerable number of penalties will record shot quality, at both ends of the rink, which is Team SQA SQF Total NYR 0.332 0.408 0.740 higher than average (and vice versa). It is SJS 0.135 0.146 0.281 unlikely that situation is incorrectly reported in CAR 0.138 0.118 0.256 the RTSS data. And we have no way of STL 0.066 0.119 0.186 verifying this information in any case. PHO 0.065 0.053 0.118 ANA 0.041 0.037 0.079 The last driver of shot quality is shot type. For MIN 0.026 0.050 0.076 PIT 0.052 0.022 0.074 a shot of a given distance a robust short quality WAS 0.069 -0.006 0.063 model will assign differing goal probabilities to CBJ 0.075 -0.036 0.039 shot types tip-in, slap, snap, wrist, backhand and OTT 0.034 -0.016 0.017 wrap (listed in declining order of danger). I FLA 0.024 -0.012 0.012 think that wrist and snap shots are easily DET 0.025 -0.016 0.009 CHI 0.004 -0.017 -0.013 interchanged. A wrap shot is really just a CAL -0.024 -0.003 -0.027 circumstance of a wrist or snap shot and could BOS -0.034 -0.006 -0.040 easily be misreported. I suspect that tip-ins are NAS 0.001 -0.049 -0.047 more aggressively reported by some scorers and EDM -0.050 -0.002 -0.052 less so by others. I have not investigated the NYI 0.026 -0.095 -0.069 VAN 0.002 -0.086 -0.084 potential for misreporting shot type. PHI -0.030 -0.054 -0.084 COL -0.071 -0.018 -0.088 Shot distance certainly seems part of the NJD -0.065 -0.041 -0.106 problem of reporting bias. Shot type may also MTL -0.046 -0.073 -0.119 be a source of error. For the moment the source TOR -0.062 -0.068 -0.131 ATL -0.067 -0.086 -0.154 of the errors does not really matter. DAL -0.070 -0.129 -0.199 LAK -0.114 -0.086 -0.200 BUF -0.084 -0.172 -0.256 TBL -0.188 -0.112 -0.300

Copyright Alan Ryder, 2007 Hockey Analytics www. HockeyAnalytics.com Product Recall Notice for Shot Quality 8

The Implications

There are two standard uses of shot quality models.

The first is to properly balance our view of the relative impact of defense and goaltending. Without adjustment for the reporting effects we would conclude that goaltenders in places like Manhattan, Calgary and San Jose are hard done by their woeful defense and we would make an incorrect adjustment to get to their shot quality neutral save percentages in an attempt to compensate. Likewise we would conclude that goaltenders in places like Tampa Bay and Los Angeles are benefiting from superior defense, and dial down their SQNSVs when in fact they are not so fortunate. Below is an indication of the potential impact of this on selected goalies, using road SQA’s rather than overall SQAs:

Save Percentages SQNSV based SQNSV based Goaltender Team Unadjusted on overall SQA on road SQA Henrik Lundqvist NYR .917 .927 .915 Evgeni Nabakov SJS .914 .914 .907 Cam Ward CAR .897 .902 .893 Ryan Miller BUF .911 .902 .906 Johan Holmqvist TBL .893 .888 .898

Before this work I was getting ready to crown Lundqvist as the unheralded king of the crease …

The other frequent use is to assess expected versus actual offense. Based on the SQF reporting discrepancies we would have excessive offensive expectations for teams such as the Rangers, Sharks, Blues and Hurricanes and understated expectations of teams such as the Sabres, Stars and Lightning.

The Fix

I always felt that there was substantial variation in the reporting of shots, especially distance. But I took comfort in one of the laws of large numbers – errors average out.

But we clearly have an issue of RTSS scorer bias. The problem appears to be brutal at Madison Square Garden but is clearly non-trivial elsewhere. The NHL needs a serious look at the consistency of this process.

Clearly the use of ‘overall’ SQ factors, heavily polluted by ‘rink effects’, has the potential to mislead. One obvious solution is the use of road factors only (as I have done above). This has the benefit of bringing back into play the comfort that ‘errors average out’ in a large, diverse data set. Road SQA and SQF factors for 2007 are detailed above.

The problem with the simple solution is best illustrated by example. If we use road factors only, the Rangers are never influenced by Madison Square Bias. But the Islanders, Devils, Flyers and Penguins each play about 10% of their road games in Manhattan. Each of these teams has a dose of Madison Square Bias in their data,

Copyright Alan Ryder, 2007 Hockey Analytics www. HockeyAnalytics.com Product Recall Notice for Shot Quality 9 overstating goal probabilities at both ends. These teams have more of this than other teams in the East and, of course, way more than the teams of the West.

Using road SQ factors is much better than using overall factors. But the best solution is to solve for rink-neutral SQA and SQF factors. This involves solving a system of simultaneous equations.

Shot Quality Recall

It’s not really a recall. Shot quality is not broken. Just don’t use it without understanding it. Use the road factors!

Shot quality is a very powerful tool. Like any tool the user needs to understand (a) how to use it and (b) its limitations. My method for assessing shot quality relies on the underlying data. There is considerable potential here for “garbage in, garbage out”. But, properly used, it remains a powerful tool.

And the Devils really are (almost) that good.

Copyright Alan Ryder, 2007 Hockey Analytics www. HockeyAnalytics.com