Skill, Luck and Hot Hands on the PGA Tour

June 21, 2005

Robert A. Connolly Richard J. Rendleman, Jr. Kenan-Flagler Business School Kenan-Flagler Business School CB3490, McColl Building CB3490, McColl Building UNC - Chapel Hill UNC - Chapel Hill Chapel Hill, NC 27599-3490 Chapel Hill, NC 27599-3490 (919) 962-0053 (phone) (919) 962-3188 (phone) (919) 962-5539 (fax) (919) 962-2068 (fax) [email protected] [email protected]

We thank Tom Gresik and seminar participants at the University of Notre Dame for helpful comments. We provide special thanks to Carl Ackermann and David Ravenscraft who provided significant input and assistance during the early stages of this study. We also thank Mustafa Gultekin for computational assistance and Yuedong Wang and Douglas Bates for assistance in formulating, programming and testing portions of our estimation procedures. Please direct all correspondence to Richard Rendleman.

Skill, Luck and Hot Hands on the PGA Tour

1. INTRODUCTION Like all sports, outcomes in involve elements of both skill and luck. Perhaps the highest level of skill in golf is displayed on the PGA Tour. Even among these highly skilled players, however, a small portion of each 18-hole score can be attributed to luck, or what players and commentators often refer to as “good and bad breaks.” The purpose of our work is to determine the extent to which skill and luck combine to determine 18-hole scores in PGA Tour events. We are also interested in the question of whether PGA players experience “hot or cold hands,” or runs of exceptionally good or bad scores, in relation to those predicted by their statistically-estimated skill levels. From a psychological standpoint, understanding the extent to which luck plays a role in determining 18-hole golf scores is important. Clearly, a player would not want to make swing changes to correct abnormally high scores that were due, primarily, to bad luck. Similarly, a player who shoots a low round should not get discouraged if he cannot sustain that level of play, especially if good luck was the primary reason for his score. At the same time, it is important for a player to know whether his general (mean) skill level is improving or deteriorating over time and to understand that deviations from past scoring patterns may not be due to luck alone. From a policy standpoint, it seems reasonable that the PGA Tour and other professional golf tournament organizations should attempt to minimize the role of luck in determining tournament outcomes and qualification for play in Tour events. Ideally, the PGA Tour should be comprised of the most highly-skilled players. Also, the Tour should strive to conduct tournaments that reward the most highly skilled players rather than those who experience the greatest luck. In some cases, luck in a round of golf can easily be identified. In the final round of the 2002 Bay Hill Invitational, ’s approach shot to the 16th hole hit the pin and bounced back into a water hazard fronting the green. Duval took a nine on the hole. Few would argue that Duval’s score of nine was due to bad judgment or a sudden change in his skill level, and it is highly unlikely that Duval made swing changes or changes to his general approach to the game to correct the type of problem he incurred on the 16th hole. In contrast, consider the good fortune experienced by in the final round of the 2002 Players Championship, when he chipped in on holes 16 and 18 and sunk a 28-foot putt for 2 birdie on the 17th hole en route to victory. Was Perks’ victory, his first on the PGA Tour, a reflection of exceptional ability or four consecutive days of good luck? The fact that his best season since 2002 placed him 146th on the PGA Tour’s Official Money List suggests that luck may have been the overriding factor. Similarly, as of June, 2005, has had only has had only one top-10 finish since winning the 2003 British Open. At this point, few would argue that Curtis’s victory was anything other than a fluke. In many situations, specific occurrences of good and bad luck may not be as obvious as in the examples above. Luck simply occurs in small quantities as part of the game. Even players as highly skilled as and cannot produce perfect swings on every shot. As a result, a certain human element of luck is introduced to a shot even before contact with the ball is made. (An excellent article on the role of luck in golf, with a focus on how the ultimate outcome of the dual between Tiger Woods and Chris DiMarco in the 2005 Masters was determined as much by luck as by skill, is provided in Jenkins (2005).) Our work draws on the rich literature in sports statistics. Klaassen and Magnus (2001) model the probability of winning a tennis point as a function of player quality, context (situational) variables, and a random component. They note that failing to account for quality differences will create pseudo-dependence in scoring outcomes, because winning reflects, in part, player quality, and player quality is generally persistent. Parameter estimates from their dynamic, panel random effects model using four years of data from Wimbledon matches suggest that the iid hypothesis for tennis points is a good approximation, provided there are controls for player quality. The most important advantage of their approach from our perspective is that it supports a decomposition of actual performance into two parts: an expected score based on skill and an unexpected residual portion. In this paper, we decompose individual golfer’s scores into skill-based and unexpected components by using Wang’s (1998) smoothing spline model to estimate each player’s mean skill level as a function of time while simultaneously estimating the correlation in the random error structure of fitted player scores and the relative difficulty of each round. The fitted spline values of this model provide estimates of expected player scores. We define luck as deviations from these estimated scores, whether positive or negative, and explore its statistical properties. Our tests show that after adjusting a player’s score for his general skill level and the relative difficulty of the course on the day a round is played, the average (and median) standard deviation of residual golf scores on the PGA Tour is approximately 2.7 strokes per 18-hole round, ranging

3 between 2.1 and 3.5 strokes per round per player. We find clear evidence of positive first-order autocorrelation in the error structure for over 13 percent of the golfers and, therefore, conclude that a significant number of PGA Tour players experience hot and cold hands. We also apply some traditional hot hands tests to the autocorrelated residuals and come to similar conclusions. However, after removing the effects of first-order autocorrelation from the residual scores, we find little additional evidence of hot hands. This suggests that any statistical study of sporting performance should estimate skill dynamics and deviations from normal performance while simultaneously accounting for the relative difficulty of the task and the potential autocorrelation in unexpected performance outcomes. The remainder of the paper is organized as follows. In Section 2 we describe our data and criteria for including players in our statistical samples. In Section 3 we present the results of a number of fixed-effects model specifications to identify the variables that are important in predicting player performance. In Section 4 we formulate and test the cubic spline-based model for estimating player skill. This model becomes the basis for our analysis of player skill, luck and hot hands summarized in Section 5. A final section provides a summary and concluding comments.

2. DATA We have collected individual 18-hole scores for every player in every stroke-play PGA Tour event for years 1998-2001 for a total of 76,456 scores distributed among 1,405 players. Our data include all stroke play events for which participants receive credit for earning official PGA Tour money, even though some of the events, including all four “majors,” are not actually run by the PGA Tour. The data were collected, primarily, from www.pgatour.com, www.golfweek.com, www.golfonline.com, www.golfnews.augustachronicle.com, www.insidetheropes.com, and www.golftoday.com. When we were unable to obtain all necessary data from these sources, we checked national and local newspapers, and in some instances, contacted tournament headquarters directly. Our data cover scores of players who made and missed cuts. (Although there are a few exceptions, after the second round of a typical PGA Tour event, the field is reduced to the 70 players, including ties, with the lowest total scores after the first two rounds.) The data also include scores for players who withdrew from tournaments and who were disqualified; as long as we have a score, we use it. We also gathered data on where each round was played. This is especially important for

4 tournaments such as the Bob Hope Chrysler Classic and ATT Pebble Beach National Pro-Am played on more than one course. The great majority of the players represented in the sample are not regular members of the PGA Tour. Nine players in the sample recorded only one 18-hole score over the 1998-2001 period, and 565 players recorded only two scores. Most of these 565 players qualified to play in a single US Open, British Open or PGA Championship, missed the cut after the first two rounds and subsequently played in no other PGA Tour events. As illustrated in the top panel of Figure 1, 1,069 players, 76.1 percent of the players in the sample, recorded 50 or fewer 18-hole scores. As illustrated in the bottom panel, this same group recorded 5,895 scores, which amounts to only 7.7 percent of the total number of scores in the sample. 1,162 players, representing 82.7 percent of the players in the sample, recorded 100 or fewer scores. The total number of scores recorded by this group was 12,849, or 16.8 percent of the sample. The greatest number of scores recorded by a single player was 457 recorded by . The players who recorded 50 or fewer scores represent a mix of established senior players such as (19 scores), (18) and Larry Nelson (17), relatively inactive middle- aged players including Bobby Clampett (16), (16) and (22), “old-timers” such as (32), (24) and (18), up-and-coming stars including (32) and (20), established European Tour players such as (49), (48) and Robert Karlsson (46), and a large number of players that most readers of this paper, including those who follow the PGA Tour, have probably never heard of. Clearly, this group, which accounts for 76.1 percent of the players in the sample, but only 7.7 percent of the scores, is not representative of typical PGA Tour participants. Therefore, including them in the sample could cause distortions in estimating the statistical properties of golf scores on the Tour. In estimating the skill levels of individual players, we employ the smoothing spline model of Wang (1998) which adjusts for correlation in random errors. Simulations by Wang indicate that 50 sample observations is probably too small, and that approximately 100 or more observations are required to obtain dependable statistical estimates of cubic-spline-based mean estimates. After examining player names and the number of rounds recorded by each player, we have concluded that a sample of players who recorded more than 90 scores is reasonably homogeneous and likely to meet the minimum sample size requirements of the cubic spline methodology.

5

3. MODEL IDENTIFICATION The model used in our tests, based on Wang’s smoothing spline, is computationally intensive and requires approximately a full day to run on a 3.19 GHz PC operating under Windows XP. Because of its computational requirements, we employ two simpler models to identify the relevant variables and appropriate model form for our tests. In model 1, we employ a fixed-effects multiple regression to predict a player’s 18-hole score for a given round as a function of the level of skill displayed by the player throughout the entire 1998-2001 period and the relative difficulty of each round. In model 2 each player’s skill level is allowed to change (approximately) by calendar year, but otherwise, the model is identical to model 1. We show that including approximate calendar year- based time dependency significantly improves the predictive power of the model. We also estimate various versions of model 2 to determine whether alternative model specifications might be warranted. Using model 2 we show that the best way to estimate the relative difficulty of a given golf round is through a round-course interaction dummy variable. Although most tournaments on the PGA Tour are played on a single course, several are played on more than one course. Using round- course interactions to predict the relative difficulty of each round is significantly more powerful than employing a dummy for the round alone (without interacting with the course), the course alone (without interacting with the round), or the tournament alone (without interacting with either the course or the round). We also show that individual player performance is not affected by the courses on which he plays.

3.1. Model 1

Let sij, denote the 18-hole score of player i (with in=1,.... , alphabetically) in PGA Tour round-course interaction j (with j =1,....m ). A round-course interaction is defined as the interaction between a regular 18-hole round of play in a specific tournament and the course on which the round is played. For most tournaments, only one course is used and, therefore, there is only one such interaction per round. However, over the sample period, 25 of 182 tournaments were played on more than one course. For example, in the Bob Hope Chrysler Classic, the first four rounds are played on four different courses using a rotation that assigns each tournament participant to each of the four courses over the first four days. A cut is made after the fourth round, and a final round is played the

6 fifth day on a single course. Thus, the Bob Hope tournament consists of 17 round-course interactions – four for each of the first four days of play and one additional interaction for the fifth and final day. According to model 1, the predicted score for player i played in connection with round-course interaction j is determined by the following fixed-effects multiple regression model:

nm sprij,,=+α ∑∑βγε k k + c c + ij (1) kc==22

In (1), pk is a dummy variable that takes on a value of 1 if player ik= and zero otherwise,

rc is a dummy that takes on a value of 1 if round-course interaction j = c and zero otherwise, and

εij, is an error term with E ()εij, = 0 . In the model, the first round of the 1998 Mercedes Championships, played on a single course, is round-course interaction j =1. Therefore, the regression intercept, α , represents the expected score of the first player ( k =1) in the first round of the 1998 Mercedes Championships, βk represents the differential amount that player k >1 would be expected to shoot in the first round of the 1998 Mercedes, and γ c denotes the additional amount that all players, including the first player, would be expected to shoot in connection with any other round- course combination. The only restriction placed on the data is that a player must have recorded at least four 18-hole scores, the minimum number required for the regression to be of full rank, to be included in the regression estimate. With this restriction, the data include 75,054 scores for 810 players recorded over a possible 848 round-course interactions, although no single player participated in more than 457 rounds. Berry (2001) employs a similarly-constructed random effects model that takes into account the skill level of each player and the intrinsic difficulty of each round to measure the performance of Tiger Woods relative to others on the PGA Tour. However, Berry’s model does not take account of the potential for a given round to be played on more than one course. It should be noted that when estimating round-course interaction coefficients, no specific information about course conditions, adverse weather, etc. is taken into account. Nevertheless, if such conditions combine to produce abnormal scores in a given 18-hole round, the effects of these conditions should be reflected in the estimated coefficients. Under model 1, the highest round-course interaction coefficient is associated with the portion of the third round of the 1999 ATT Pebble Beach National Pro-Am played on the Pebble Beach course. For this particular round-course interaction, wind and rain combined to make the famed course extremely difficult and forced tournament officials

7 to cancel the event during the fourth round due to unplayable conditions. (Our final analysis shows that the Pebble Beach course played 5.8 strokes more difficult on the third day than on the first two days.)

3.2. Model 2 In the previous model, a player’s skill coefficient is assumed to be constant throughout the entire 1998-2001 period. In model 2, a player’s skill coefficient is allowed to vary through time on an approximate calendar year basis. In principle, we wish to estimate a single regression equation with separate skill coefficients estimated for each player in each calendar year. However, experiments with a model of this form reveal that a number of players whose scores are included in the estimation of model 1 would have to be removed from the original sample for the regression to be of full rank. To overcome the potential problem of singularity, while maintaining the same sample employed in the estimation of model 1, we employ the following method for allowing an individual player’s skill coefficient to vary through time. If a full calendar year has passed, at least 25 scores were used in the estimation of the previous skill coefficient for the same player, and at least 25 additional scores were recorded for the same player, then a new incremental skill coefficient for that player is estimated. For players who participated actively in all four years of the sample, this procedure results in the estimation of different mean skill levels in each of the four years. Although the 25-score criterion is somewhat arbitrary, it is not critical, since the model we will actually use for estimating player scores is based on Wang’s smoothing spline methodology and does not use the 25- score criterion.

As before, let sij, denote the 18-hole score of player i in connection with PGA Tour round- course interaction j. Define pktk, ()as a dummy variable that takes on a value of 1 if player ik= and the player-specific (approximate) calendar year-based time period equals or exceeds tk()=1,... Tk( ) and zero otherwise, where Tk() denotes the total number of player-specific (approximate) calendar year-based time periods attributable to player k. Thus, if ik= , pk,1 =1 in all time periods, pk,2 =1 in periods 2 through Tk() but zero otherwise, pk,3 =1 in periods 3 through Tk() but zero otherwise,

8

and pk,4 =1 in period 4 only. Finally, let βktk, () denote the value of incremental skill coefficient tk() for player k. Then, the basic version of model 2, denoted as 2.1, is specified as follows:

Tk() n Tk( ) m sp=+αβ + β p ij, ∑∑∑k==1, tk() k 1, tk () ktk ,() ktk , ()++∑γ ccr ε ij, (2.1) tk()===221 k tk() c=2 As in model 1, the first player in the sample is player k =1, and the first round of the 1998 Mercedes Championships is round-course combination j = 1. Therefore, the regression intercept, α , represents the first player’s expected score in the first round of the 1998 Mercedes Championships.

Tk()

The term ∑ βktkktk==1,()p 1, () picks up potential incremental changes in skill for the first player starting tk()=2 in his second estimation period, typically 1999. In all forms of model 2, the skill coefficient βk ,1 is estimated in each of the Tk() estimation periods for all players k >1. Starting in the second period, the coefficient βk,2 is estimated for all players, kn=1,.... for whom Tk( ) >1, and so on. To determine the best way to estimate the relative difficulty of each round, we estimate three additional forms of model 2, denoted as models 2.2 through 2.4, respectively. In model 2.2 we substitute 728 round dummies for the 848 round-course interaction dummies in model 2.1. We define a “round” as a round of play in a specific tournament, regardless of whether the round is played on more than one course. Therefore, this specification ignores differences in scores for the same round that are played on different courses. In model 2.3 we substitute 182 tournament dummies for the round-course interaction dummies model 2.1. Each tournament dummy denotes a particular tournament played in a given year. For example, there are four tournament dummies associated with the Masters, played in each of the four years 1998-2001. According to this specification, the relative difficulty of each round is determined by the tournament and does not vary from day to day within the tournament or over multiple courses that might be used for the tournament. Finally, in model 2.4 we substitute 77 course dummies, which indicate the courses on which the rounds are played, for the round-course interaction dummies in model 2.1. As such, this specification assumes that the relative difficulty of a round is determined strictly by the course on which the round is played and that the difficulty of the course does not change from day to day within a tournament or even from year to year. Table 1 summarizes the performance of these models using both a full sample of players (75,054 observations) and a sample restricted to the players who recorded more than 90 scores (64,364 observations).

9

Using the full sample, the adjusted R2 is 0.2996 for model 1 and 0.3077 for model 2.1. The F-statistic for testing the difference between model 2.1 and model 1 is 2.44, which, given the large sample size, is significant at any normal testing level. This indicates that there is added value to including (approximate) calendar year-based time variation in mean player skill levels. F-statistics for testing model 2.1 against alternative specifications using dummies for rounds, tournaments and courses, rather than for round-course interactions, are 5.66, 9.66 and 10.67, respectively, indicating that model 2.1, which uses round-course interactions, is superior. Very similar results are obtained for the same tests using the restricted sample. It should be noted that Berry’s (2001) model for predicting player scores is equivalent to our model 1 using random effects for actual rounds rather than fixed effects for round-course interactions. The results in Table 1 indicate that including calendar-year-based time variation in player skill levels and adjusting scores for the relative difficulty of round-course interactions is a superior model specification. We also attempted to estimate a fifth version of model 2 to test whether there is added value to including interactions among players and courses. With the full sample of players, the number of coefficients to be estimated increases from 2,224 using model 2.1 to 17,201 when player-course interactions are included. With the sample restricted to only those players that competed in more than 90 rounds, the number of coefficients increases from 848 to 13,383. Unfortunately, the regressions are of such large scale that we were unable to obtain sufficient resources to estimate them on the university’s statistical mainframe computer. Therefore, we address the issue of player-course interactions by including course dummies in random effects analyses of individual player residual scores from our final spline-based estimation model. These tests, described in Section 4.1, show no evidence of significant player-course interaction effects. Also, tests of player performance in adjacent rounds of the same tournament, summarized in Table 5 and discussed in Section 5, provides additional evidence of no significant effects.

4. THE SPLINE-BASED ESTIMATION MODEL Based on the previous analysis, we conclude that the best way to adjust player scores for the relative difficulty of each round is to use round-course interactions rather than rounds (ignoring courses), courses (ignoring rounds) or tournaments (ignoring both rounds and courses). We also conclude that our estimation model should reflect time-dependent estimates of mean player skill. However, we recognize that allowing player skill to change only by calendar year is somewhat

10 arbitrary. Therefore, in model 3 we employ a more general specification of time dependency in mean player skill that does not require skill changes to occur only at the turn of the year or at any other pre- specified points in time. In model 3 we estimate each player’s mean skill level as a function of “golf time” using the restricted maximum likelihood (REML) version Wang’s (1998) smoothing spline model which adjusts for correlated random errors. Player k’s “golf time” counts successive competitive PGA Tour rounds of golf played by player k regardless of how these rounds are sequenced in actual calendar time. Inasmuch as there are likely to be gaps in actual time between some adjacent points in golf time, it is unlikely that random errors around individual player spline fits follow higher-order autoregressive processes. At the same time, however, we do not want to rule out the possibility that the random errors may be correlated. Model 3, which uses Wang’s REML model for estimating mean individual player skill as a function of golf time, is formulated as follows. Define player k’s golf time as the sequence

gk()=1,.... Gk (), and let gjk () denote the mapping of the sequence jm=1,.... to the sequence

gk()=1,.... Gk () such that the sequence gjk ( ) represents the “golf time” of all round-course interactions, j, for which player k recorded an 18-hole score. For example, assume that player k’s first three scores were recorded in connection with round-course interactions 6, 7, and 16. Then,

gk ()61,= gk ()72,= and gk ()16= 3. With this mapping, model 3 becomes:

nm ⎡⎤ sgjgjprij, =++∑∑()fkk⎣⎦ k()++θγξ⎣⎦⎡⎤ k () k()cc cκ ij, kc==12

nm ⎡⎤ =++∑∑()fkk⎣⎦gjkkkccij()+θ ⎣⎦⎡⎤ gj () pωκ r , , kc==12 with ωc = γξcc+ and E (κij, ) = 0 . In model 3, fk (gjk [ ]) is the smoothing spline function applied to player k’s golf scores over golf time gk( ) =1,.... Gk( ) . θk ( gjk [ ]) is the random error associated with the spline fit for player k evaluated with respect to round-course interaction j as mapped into

′ 2 ‐1 2 golf time gjk (), with θκ = ()θθkk[]1 ,....⎣⎦⎡⎤Gk() ∼ N() 0, σ kWk and σ k unknown. Note that the intercept is absorbed by the fk ()gjk [ ] terms.

11

It is convenient to re-express θk ( gjk [ ]) as θϕηkkk( gjkkk[ ]) =+( gj[ ]) () gj[ ] , where

ϕk ()gjk [ ] represents the predicted error associated with player k’s spline fit as of golf time gjk ( ) as a function of past errors, θθkk()1 ,....( gjk [ ]− 1) , and ηk ( gjk [ ]) denotes the remaining residual error. Substituting θϕηkkk()gjkkk[ ] =+() gj[ ] ( gj[ ]) , model 3 becomes:

n m ⎡⎤ sgjgjgjprij, =++∑∑()fkkk⎣⎦ k()++ϕη⎣⎦⎣⎦⎡⎤⎡⎤ k () k () kω ccκ ij, (3) k =1 c=2 It is important to recognize that in estimating model 3, the autocorrelation in random errors about each spline fit is not removed. Instead, the spline fits and the autocorrelation parameters associated with the random error around the spline fits are estimated simultaneously. ′ Model 3 represents a generalized additive model with random effects η = ()η1,....ηn and

′ ξ = ()ξ1,....ξm associated with players in= 1,.... and round-course interactions j = 1,....m , respectively. Although the smoothing spline functions allow for general autocorrelation structures in

θi , we assume that the θi follow player-specific AR(1) processes. The methodology for estimating model 3 is described in the appendix.

4.1. General Properties of Model 3 Throughout the remainder of the paper, frequent reference is made to two different types of residual errors from the spline fits of model 3. To avoid confusion, we hereby define the two residual error types. In model 3, θii()gj[ ] is the total residual error associated with spline fit i as of golf time gji (). ϕii()gj[ ] represents the predicted error associated with player i’s spline fit as of golf time gji () as a function of past errors, θθii(1) ,....(gji [ ]− 1), and ηi (gji [ ]) represents the error component of the spline fit that is not predictable from past residual errors, θii()gj[ ] . We assume that the error θii()gj[ ] follows an AR(1) process with first-order autocorrelation coefficient φi . The spline fit fi and correlation φi are estimated simultaneously using Wang’s (1998) smoothing spline methodology. For some of our tests we focus on residual errors ηii(gj[ ]), which we refer to as η errors. If the assumed AR(1) process properly captures the correlation structures of residual player

12 scores, η errors should represent white noise. For other tests, we focus on autocorrelated residual errors θii()gj[ ] , which we refer to as θ errors. In formulating model 3, we assume that an AR(1) process describes the residual error structure about the cubic spline fit for each player. If this is a valid assumption, the η errors for each player should be serially uncorrelated. However, if this is not a valid assumption, there may be additional autocorrelation in the η errors. To check for possible model misspecification in connection with the AR(1) assumption, we estimated autocorrelation coefficients of order 1-5 on the η errors of each player. In no instance were any estimated coefficients statistically significant at the 5 percent level. In addition, we computed the Ljung-Box (1978) Q statistic associated with the η errors for each player for lags equal to the minimum of 10 or 5 percent of the number of rounds played. Ljung (1986) suggests that no more than 10 lags should be used for the test and Burns (2002) suggests that the number of lags should not exceed 5 percent of the length of the series. Only 6 of 253 Q statistics were significant at the 5 percent level. As a final diagnostic, we ran random effects models relating θ and η errors to course dummy variables for each of the 253 players, assigning a dummy to a course if a player played the course at least twice, the minimum required for a model to be of full rank. None of the F-tests associated with the 253 random effects tests were significant at the 5 percent level for either type of error. Therefore, we conclude that the autocorrelation properties of the spline function have been properly specified and that it is unnecessary to include player-course interactions in the estimation model. Figure 2 shows plots of spline fits for four selected players. All figures in the upper panel show 18-hole scores reduced by the effects of round-course interactions (connected by jagged lines) along with corresponding spline fits (smooth lines). The same spline fits (smooth lines) are shown in the lower panel along with predicted scores adjusted for round-course interactions (connected by jagged lines), computed as a function of prior residual errors, θii( gj[ ]) , and the first-order autocorrelation coefficient, φi , estimated in connection with the spline fit. The scale of each plot is different. Therefore, the spline fits in the lower panel appear as stretched versions of the corresponding fits in the upper panel and visually, plots for the four players are not directly comparable.

13

The two plots for reflect 10.7067 degrees of freedom used in connection with the estimate of his spline function, the largest degrees of freedom value among the 253 players. In contrast, the spline fit for Tiger Woods, estimated with 2.0011 degrees of freedom, is more typical; almost 71 percent of the 253 spline fits use 2.25 or fewer degrees of freedom and have the same general appearance as that estimated for Woods. The fit for reflects φ =− 0.2791, the most negative first-order autocorrelation coefficient estimated in connection with the 253 splines. Also, the standard deviation of the θ residual errors around the spline fit for Leggatt is 2.14 strokes per round, the lowest among the 253 players. Interestingly, after adjusting for prior θ errors and first-order autocorrelation, φ , the predicted score for , shown as the jagged line in the lower panel, is the lowest among all 253 players at the end of the 1998-2001 sample period. Figure 3 shows six histograms that help to summarize the 253 cubic spline-based mean player skill functions. The first histogram shows the degrees of freedom for the various spline fits. Three of the fits have exactly two degrees of freedom, while the degrees of freedom for 179, or 71 percent of the splines, is less than 2.25. This implies that a large majority of the spline fits are essentially linear functions of time, such as that illustrated in Figure 2 for Tiger Woods. (It should be noted that for each spline, an additional degree of freedom, not accounted for in the histograms, is used up in estimating the AR(1) correlation structure of the residuals.) Fifty-three of the spline fits have three or more degrees of freedom, implying a departure from linearity. Twelve of the splines have five or more degrees of freedom, and as noted earlier, the largest number of degrees of freedom is 10.71 for Chris Smith. The histogram in the lower left-hand panel of Figure 3 shows the distribution of first-order autocorrelation coefficients estimated in connection with the 253 splines. As illustrated in the lower left-hand panel, the residual θ errors of 158, or 62 percent of the spline fits, are positively correlated. We employ the bootstrap method to test the significance of individual player spline fits against alternative specifications of player skill. To maintain consistency in our testing methods, we apply the same set of bootstrap samples to test the significance levels of the first-order autocorrelation coefficients estimated in connection with each fit. All bootstrap tests are based on balanced sampling of 40 samples per player. Wang and Wahba (1995) describe how the bootstrap method can be used in connection with smoothing splines that are estimated without taking into account the autocorrelation in residual errors. We modify the method outlined in Wang and Wahba so that the bootstrap samples are based on η residuals which adjust predicted scores for

14

autocorrelation in prior θ residuals. Forty bootstrap samples is the minimum necessary to estimate two-sided 95 percent confidence intervals. Although 40 samples is well below the number required to estimate precise confidence intervals for each individual player, a total of 40×= 253 10,120 bootstrap samples are taken over all 253 players, requiring over three days of computation time using a 3.19 GHz computer running under Windows XP. The 10,120 total bootstrap samples should be more than sufficient to draw general inferences about statistical significance within the overall population of PGA Tour players. After sorting the correlation coefficients computed via the 40 bootstrap samples from lowest to highest, the number significantly negative is the number of players for which the 39th correlation coefficient is negative, and the number significantly positive is the number of players for which the second correlation coefficient is positive. We find that six of 95 negative autocorrelation coefficients are significant at the 5 percent level while 32 of 158 positive coefficients are significant. Although the implications of positive autocorrelation are developed in more detail later, based on the number of players displaying significant positive autocorrelation, it is clear that a substantial number of players exhibit hot and cold hands in their golfing performance over the sample period. The four remaining histograms in Figure 3 summarize bootstrap tests of the 253 cublic spline fits against the following alternative methods of estimating a player’s 18-hole score (after subtracting the effects of round-course interactions): 1. The player’s mean score. 2. The player’s mean score in each approximate calendar year period as defined by model 2. 3. The player’s score estimated as a linear function of time. 4. The player’s score estimated as a quadratic function of time.

RSE( alt) RSE() spline For each of the four tests we form a test statistic, ζˆ =− , where G is Gdf− alt Gdf−− spline 1 the number of 18-hole scores for a given player, dfalt and dfspline are the number of degrees of freedom associated with the estimation of the alternative and spline models, respectively, and RSE() alt and RSE() spline are total residual squared errors from the alternative and spline models. We subtract 1 in the denominator of the second term to account for the additional degree of freedom associated with the estimation of the first-order autocorrelation coefficient. For the purposes of this test, RSE() spline is based on η errors, since these errors reflect the complete information set

15

derived from each spline fit. The test statistic ζˆ is suggested by Efron and Tibshirani (1998, pp. 190-192 and p. 200 [problem 14.12]) for testing the predictive power of an estimation model formulated with two different sets of independent variables. The test statistic is computed for each of 40 bootstrap samples per player. In a one-sided test of whether the spline fit is a superior specification to the alternative model, ζˆ should be positive in 38 or more of the 40 bootstrap samples. The first line of Table 2 shows that the spline model is significantly superior (at the 5 percent level) to the player’s mean score for approximately 30 percent of the 253 players in the sample. It is significantly superior to a linear time trend for 8 percent of the players, to the mean score computed for each (approximate) calendar year for 5 percent, and to a quadratic time trend for 5 percent of the players. The second line of the table shows that the percentage of players for which the alternative model is generally superior (but not statistically superior) is 30 percent and 17 percent for the linear and quadratic time trends, respectively, and only 8 percent and 4 percent for the mean score and mean score per (approximate) calendar year. (For a given player, the alternative model is generally superior if ζˆ is positive for less than 20 of 40 bootstrap samples.) Although the spline model is superior to the alternative models at the 5 percent level for a relatively small percentage of players, it “beats” the alternative models a much larger proportion of the time than could be predicted by chance. Therefore, we conclude that the spline model is superior to any of the four alternative specifications.

5. PLAYER PERFORMANCE 5.1 Skill Based on bootstrap sampling, the preceding analysis shows that approximately 30 percent of cubic spline-based player score estimates are significantly superior at the 5 percent level to estimates based solely on mean scores. Moreover, the mean squared residual error for each of the original RSE() spline cubic spline fits, measured as , is less than the mean squared residual about the mean, Gdf−−spline 1 RSE() mean measured as , for 70 percent of the players in the sample. These results provide strong G −1 evidence that the skill levels of PGA Tour players change through time. For many players such as

16

Tiger Woods, the relationship between the player’s average skill level and time is well approximated by a linear time trend, after adjusting for autocorrelation in residual errors. For others, such as Chris Smith, Ian Leggatt and Bob Estes, the relationship between mean player skill and time is more complex and cannot easily be modeled by a simple parametric-based time relationship. The player-specific cubic spline functions can provide point estimates of expected scores, adjusted for round-course interactions, at the end of the 1998-2001 sample period, and, therefore, can be used to rank the players at the end of 2001. Table 3 provides a summary of the best players among the sample of 253 as of the end of the 2001 based on the cubic spline point estimates shown in column 1 of the table. Column 2 shows estimates of player scores after adjusting for autocorrelation in θ residual errors around each spline fit. The values in column 1 can be thought of as estimates of mean player skill at the end of the sample period. In contrast, the values in column 2 can be thought of as estimates of each player’s last score as a function of his ending mean skill level and the correlation in random errors about prior mean skill estimates. The Official World Golf Ranking is also shown for each player as of 11/04/01, the ranking date that corresponds to the end of the official 2001 PGA Tour season. The World Golf Ranking is based on a player’s most recent rolling two years of performance, with points awarded based on position of finish in qualifying worldwide golf events on nine different tours. The ranking does not reflect actual player scores. Details of the ranking methodology can be found in the “about” section of the Official World Golf Ranking web site, www.officialworldgolfranking.com. Despite being based on two entirely different criteria, the two ranking methods produce similar lists of players. is the only player in the top 10 of the Official World Golf Ranking (number 10) as of 11/04/01 whose name does not appear in Table 3 (he is actually 21st according to our ranking), and only five players listed in Table 3 were not ranked among the Official World Golf Ranking’s top 20. Among the 20 players listed in Table 3, the predicted score for Bob Estes, adjusted for autocorrelation in residual errors (column 2), is the lowest. Although it may come as a surprise that a player with as little name recognition as Estes would be predicted to shoot the lowest score at the end of 2001, the plots of Estes’ spline function in Figure 2 show why. Estes exhibited marked improvement over the last quarter of the sample period, and the improvement was sufficiently pronounced that his estimated spline function picked it up.

17

Table 4 provides lists of the 10 players showing the improvement and deterioration in skill over the 1998-2001 sample period based on differences between beginning and ending spline-based estimates of player scores adjusted for round-course interactions. Cameron Beckman was the most improved player, improving by 3.36 strokes from the beginning of 1998 to the end of 2001. Chris DiMarco and Mike Weir, relatively unknown in 1998 but now recognized among the world’s elite players, were the fifth and sixth most improved players over the sample period. Among the 10 golfers whose predicted scores went up the most were (8.18 strokes, born 1949), Fuzzy Zoeller (2.99 strokes, 1951), Feith Fergus (2.85 strokes, 1954), Bobby Wadkins (2.74 strokes, 1951), (2.65 strokes, 1953), (2.64 strokes, 1949), and David Edwards (2.32 strokes, 1956). Clearly, deterioration in player skill appears to be a function of age, with a substantial amount of deterioration occurring during a golfer’s mid to late 40’s. But despite the natural deterioration that occurs with age, 133 of the 253 players in the sample actually improved from the beginning of the sample period to the end.

5.2 Luck To what extent does luck play a role in determining success or failure on the PGA Tour? Based on the premise that our model for predicting 18-hole scores is correct, luck represents deviations between actual and predicted 18-hole scores, either positive or negative, that are not sustainable from one round to the next. Even if a golfer plays at an unusually high skill level in a given round and cannot point to any specific instances of good luck to explain his performance, we would consider him to have been lucky if his high skill level cannot be sustained. In this view, luck is a random variable with temporal independence. 5.2.1 Average Residual Scores in Adjacent Rounds. In the analysis that follows, we test whether deviations between actual and predicted 18-hole scores are sustainable from one round to the next by examining whether players who record exceptionally good or bad scores in one round of a tournament can be expected to continue their exceptional performance in subsequent rounds of the same tournament. For this test we examine the performance of players in all adjacent tournament rounds that are not divided by a cut. A typical PGA Tour tournament involves four rounds of play with a cut being made after the second round. Therefore, for a typical tournament, we compare performance between rounds 1 and 2 and also between rounds 3 and 4. We do not compare performance between rounds 2 and 3, however, because approximately half of the players who

18 participate in round 2 are cut and do not continue for a third round. For the Bob Hope Chrysler Classic, which involves five rounds of play with a cut being made after the fourth round, we compare performance between rounds 1 and 2, rounds 2 and 3 and rounds 3 and 4 but do not compare performance between rounds 4 and 5. A similar procedure is employed for other tournaments that do not employ a cut after the second round. For the purposes of this test, we sort all θ and η residual scores in the first of each pair of qualifying rounds and place each player into one of 20 categories of approximately equal size based on the ranking of residuals in the first of each qualifying two-round pair. Within each sort category we compute the average residual score in both the first and second of the two adjacent rounds. Table 5 summarizes the results of this test for both θ and η residuals. The section of the table that summarizes the analysis for θ residuals shows a very slight tendency for scores in the 20 sort categories to carry over from one round to the next. In sort categories 1 – 10, the average first-round θ residual score is negative, and the average second-round θ residual is also negative in each of these categories. For example, on average players in sort category 1 have a first-round θ residual score of -5.184. This means that on average, players in the first group scored 5.184 strokes better (lower) than predicted by model 3. If a portion of the 5.184 strokes represents a change in skill for these same players, or an advantage due to some players having a competitive advantage on the course they played in the first round, rather than random good luck, scores in the next round should continue to be lower than predicted. Otherwise, scores of players in the first sort group should revert back to normal. In the next adjacent round (not divided by a cut), the average θ residual for these same players is –0.097. This same pattern, negative average residual first-round scores followed by negative average residual second-round scores, is present throughout the first ten sort categories. However, all of the average second-round residuals are very close to zero, with the largest, in absolute value, being 0.135 strokes for sort category 2. To shed further light on the relationship between residual scores in adjacent rounds, we run a simple least squares regression within each sort group in which the second residual in each pair is regressed against the first. If players within a group tend to continue with similar performance, this regression coefficient should be positive and significant. However, as shown in Table 5, there is no discernable pattern among the regression coefficients within the first ten sort groups, and none of the regressions is significant.

19

Among sort groups 11 – 20, all average first-round θ residuals are positive, and seven of ten average second-round residuals are positive. But except in sort categories 19 and 20, with average second-round residuals of 0.219 and 0.328, respectively, the average second-round residuals would be of little if any significance in golf. One possible explanation for the different pattern of average residuals in categories 19 and 20, and the significant positive regression coefficient in sort group 20, is that players who have performed poorly prior to a cut may take more chances in an attempt to make the cut. If riskier play tends to be accompanied by higher scores, this pattern should emerge. An alternative explanation is that many players in the last two sort groups may consider it a forgone conclusion that they will miss the cut and, perhaps, do not give their best efforts in the second of the two adjacent rounds. Although both explanations are quite different, they both involve players changing their normal method of play when the second of the two rounds is followed by a cut. We can test for this tendency by separating the average residual scores into those for which the second of the two adjacent rounds is immediately followed by a cut and those for which a cut does not occur after the second of the two rounds. Although not shown in the table, for all sort groups, except groups 19 and 20, there is little difference in performance when a cut occurs after the second of two qualifying adjacent rounds (panel 2) and when it does not (panel 3). However, for sort categories 19 and 20, the average of the θ residuals in the second of the two adjacent rounds are 0.381 and 0.406, respectively when a cut occurs immediately afterward the second of the two rounds, and –0.021 and 0.212, respectively, when a cut does not follow the second round. Moreover, the regression of second-round residuals against first-round residuals is only significant for sort category 20 when the second of the two rounds is followed by a cut. Taken together, this evidence suggests that tournament participants who perform exceptionally poorly in the earliest round(s) may change their method of play in the round before a cut to reflect the high probability of being cut. Otherwise, the threat of being cut does not appear to affect player performance. It should be noted that if a player’s exceptionally good (or bad) score as measured by his θ residual in the first of two adjacent rounds is due to an advantage (or disadvantage) in playing a particular course, the advantage should carry over into the next round, provided the next round is played on the same course. Since the large majority of adjacent rounds summarized in Table 5 involve play on the same course, the fact that the average second round residual in each sort category

20 is essentially zero provides additional evidence that player-course interactions do not play a significant role in determining 18-hole scores on the PGA Tour. Table 5 also summarizes the results of identical tests of η residuals. Unless the AR(1) model does not adequately capture the correlation structure of player scores around their respective spline fits and/or there are significant player-course interaction effects, η residuals should represent white noise uncorrelated from one round to the next. Table 5 shows that regardless of the first-round sort category, average residual scores in the second of two adjacent rounds are very close to zero, and that there is no discernable relationship between the signs of average first-round scores and those of adjacent second-round scores. As with θ residuals, we run a least squares regression within each sort group in which the second residual in each pair is regressed against the first. Only one coefficient is significant at the 5 percent level – that for sort group 4. Moreover, the fact that the average second-round residual in group 4 is so close to zero (-0.029) indicates that any tendency for the direction of abnormal performance to persist in this sort group is not accompanied by a level of performance that would make much difference in determining a player’s 18-hole score. Therefore, we conclude that significant differences between actual and predicted scores are due, primarily, to luck, and that these differences cannot be expected to persist. 5.2.2 How Much Luck Does it Take to Win a PGA Tour Event? It is interesting to consider how much luck it takes to win a PGA Tour event or to at least finish among the top players. Table 6 summarizes the actual scores and θ residual scores on a round-by-round basis for the top 40 finishers in the 2001 Players Championship. We focus on θ residual scores rather than η residuals so that we can determine the extent to which each player’s scores deviate from their time-dependent mean values. The pattern of residual scores exhibited in this table is typical of those for all 182 tournaments. Among the top 40 finishers, the total residual score for almost all players is negative; only , and Steve Flesch have positive total residual scores. Thus, to have won this tournament, and almost all others in the sample, not only must one have played better than normal, but one must have also played sufficiently well (or with sufficient luck) to overcome the collective good luck of many other participants in the same event. Over all 182 tournaments, the average total θ residual score for winners and first-place ties was –9.92 strokes, with the total residual ranging from –2.14 strokes for Tiger Woods in the 2000 ATT Pebble Beach Pro-Am to –23.48 strokes for in the 2001 Pheonix Open. (Similarly, the average η residual was –9.80 strokes.) Table 7 summarizes the highest 20 total θ

21 residual scores per tournament for winners and first place ties. It is noteworthy that no player won a tournament after recording a positive total θ residual score. Although not shown in the table, Tiger Woods’ total η residual score for the 2000 ATT Pebble Beach event was +0.32 strokes, but this is the only positive total residual using either the θ or η measure. It is also noteworthy that Woods’ name appears 11 times in Table 7 and that all the other players on the list (Phil Mickelson, Sergio Garcia, David Duval, and Davis Love III) are among the world’s most recognizable players. Thus, over the 1998-2001 period, only Tiger, and perhaps a handful of other top players, were able to win tournaments without experiencing exceptionally good luck. 5.2.3 Standard Deviation of Residual Scores. Figure 4 summarizes the standard deviation of θ residual errors among all 253 players in the sample. The range of standard deviations is 2.14 to 3.45 strokes per round, with a median of 2.69 strokes. and Phil Mickelson, both well- known for their aggressive play and propensities to take risks, have the third and 15th highest standard deviations, respectively. Ian Leggatt has the lowest standard deviation. Chris Riley and , both known as very conservative players, have the second and fifth lowest deviations, respectively. It is interesting to consider whether average scores and standard deviations of θ residual errors are correlated. A least squares regression of standard deviations against the mean of each player’s spline-based estimate of skill over the entire sample period yields: expected score=+×68.22 1.14 standard deviation , with adjusted R2 = 0.067, F = 19.02 and p-value = 0.00002. Thus, there is a tendency for greater variation in player scores to lead to slightly higher average scores. 5.2.4 Effect of Round-Course Interactions. Figure 5 summarizes the distribution of 848 random round-course interaction coefficients estimated in connection with model 3. The coefficients range in value from –3.924 to 6.946, implying almost an 11-stroke difference between the relative difficulty of the most difficult and easiest rounds played on the Tour during the 1998-2001 period. Over this period, 25 tournaments were played on more than one course. In multiple course tournaments, players are (more or less) randomly assigned to a group that rotates among all courses used for the tournament. By the time each rotation is completed, all players will have played the same courses. At that time a cut is made, and participants who survive the cut finish the tournament on the same course. Although every attempt is made set up the courses so that they play with approximately the same level of difficulty in each round, tournament officials cannot control the

22 weather and, therefore, there is no guarantee that the rotation assignments will all play with the same approximate levels of difficulty. Figure 6 shows the distribution of the difference in the sum of round-course interaction coefficients for the easiest and most difficult rotations for the 25 tournaments that were played on more than one course. For seven of 25 tournaments, the difference was less than 0.50 strokes. Thus, on average the difference for these tournaments was sufficiently small that a player’s total score should have been the same regardless of the course rotation to which he was assigned. At the extreme, there was a 5.45 stroke differential between the relative difficulties of the easiest and most difficult rotation assignments in the 1999 ATT Pebble Beach Pro-Am. Within this tournament, two of six possible rotations played with round-course interaction coefficients that totaled 10.41 and 10.23 strokes, while the total for the remaining four rotations fell between 4.96 and 5.72 strokes. The two difficult rotations involved playing the famed Pebble Beach course on the third day of the tournament. Described as one of the nastiest days since the tournament was started in 1947 (www.golftoday.co.uk/tours/tours99/pebblebeach/round3report.html), the adverse weather conditions had a much greater effect on scores recorded on the Pebble Beach course than on the other two courses, Spyglass and Poppy Hills. According to David Duval (www.golftoday.co.uk), "This is the stuff we stopped playing in last year. … It's the type of day you don't want, for the sole reason that luck becomes a big factor.'' It is interesting that the top nine finishers in this tournament all played one of the four easiest rotations, as did four of the five players who tied for 10th place. Among the top 20 finishers, only two played one of the two difficult rotations. Clearly, for this particular tournament Duval was right. The luck of the draw had more to do with determining the top money winners than the actual skill exhibited by the players.

5.3 Hot and Cold Hands Hot and cold hands represent the tendency for abnormally good and poor performance to persist over time and has been the focus of a number of statistical studies of sports. While some had argued for the presence of a hot hand in basketball, Gilovich, Vallnoe and Tversky (1985) argued that this belief might exist even though actual shooting was consistent with a random process of hits/misses (wins/losses). In other words, people may find systematic patterns in what is actually random data. Larkey, Smith, and Kadane (1989) disputed this finding. Wardrop (1995) notes that

23 the data used by the lay person to make such a determination may be substantially different from what is available to professional statisticians, and this may account (in part) for the different inferences. (In this respect, the attempts to reconcile the findings of cognitive psychologists and statisticians are reminiscent of the well-known (to economists) problem that tests of rational expectations cannot be definitive because the econometrician cannot be certain to have the same data as economic agents and to use it in the same way.) Another important difficulty associated with early tests of hot hands is the high probability of confounding influences. For example, in basketball, the opposing team may decide to double-team any shooter who appears to have a hot hand, thereby changing the likelihood of shooting success. In an important contribution to the sports statistics literature, Albright (1993) extended the early literature to consider how to formulate and execute conditional tests for random outcomes in sporting events. His approach reduced the noise in sequence tests by accounting for context (situational) effects. Klaassen and Magnus (2001) show that that the iid hypothesis is a good approximation for modelling the probability of winning a point in tennis after controlling for player quality, context (situational) variables, and a random component. While much of the hot hands literature is focused on basketball and baseball, there is some evidence on this issue for golf as well. Gilden and Wilson (1995) study hot/cold hands in putting and find more subjects display streaks in putting success/failure than would be expected by chance alone (using runs tests on success vs. failure). Further, they find that putting streaks are more likely when the difficulty of the task is commensurate with the skills of the performer: streaks are much less likely when the task is very difficult relative to skill. Clark has studied hot and cold hands in 18-hole golf scores (2003a, 2003b, and 2004a) and also in hole-by-hole scores (2004b). (An excellent summary of the psychological dimensions of hot and cold hands in golf is provided in Clark [2003a]). In general, Clark finds little evidence for hot and cold hands among players on the PGA Tour, Senior PGA (SPGA) Tour and Ladies PGA (LPGA) Tour. Throughout his studies, Clark defines a success as a score that equals or exceeds par, either on an 18-hole or hole-by-hole basis. Although he finds some evidence for streaky play in 18-hole scores, he shows that this evidence is the result of par or better being a more frequent occurrence on the easiest golf courses. In a related set of papers, (Clarke [2002a and 2002b]), finds no evidence of choking among players who are in contention going into the final rounds of regular PGA, SPGA and LPGA Tour events and the PGA Tour’s annual qualifying tournament known as Q-School. In none

24 of these papers is player performance measured relative to a player’s normal level of play or in relation to the relative difficulty of each round (or round-course interaction). Therefore, Clark’s results and those which we report below are not directly comparable. Compared with other competitive sports, measuring success and failure in golf can be done with greater precision. Unlike competitors in team sports, golfers do not have to face defenses or deal with the responses of teammates who may attempt to adapt to their seemingly hot or cold play. Although each stroke in golf represents a unique challenge, the strokes taken collectively over an entire 18-hole round represent a reasonably homogeneous challenge for most competitors in a golfing event. Moreover, except for changes in weather, intentional changes in specific course conditions and changes in venue, all of which we account for statistically through round-course interactions, the challenges associated with 18-hole golf rounds are reasonably homogeneous from one round to the next. And unlike a sport such as basketball, in which a player either makes or misses a basket, it is much easier to measure varying degrees of success and failure in golf. Clearly a golfer who shoots a 63, when his statistically estimated score is 70.2, has experienced greater success than if he had shot a 70. Perhaps the golfer’s 70 is comparable to making an easy layup that rolls around the basket before falling in, while the 63 is comparable to swishing an 18-footer after being closely guarded by two defenders. If one were testing for runs in basketball, it is likely that the layup and 18-footer would both be recorded as success, and there would likely be no distinction made between the degrees of success in the two cases. On the other hand, in golf one can make a clear distinction between a 63 and 70 and should take advantage of the distinction when testing for streaky play. Thus, conventional tests of hot hands, which tend to treat success and failure in terms of binary outcomes, may not be appropriate for golf. We believe that the most compelling tests for hot hands in golf are our previously documented bootstrap tests of the significance of the first-order autocorrelation coefficients associated with the θ residuals estimated in connection with each of our spline fits. Note that in model 3 we estimate each player’s time-varying mean score after accounting for the simultaneously estimated round-course interaction effects common to all players and the first-order autocorrelation in (θ ) residuals around each individual fit. If a player’s skill level changes over time, the spline fit should pick that up. In fact, our bootstrap tests show that 30 percent of the spline fits are significantly different than players’ (round-course adjusted) mean scores. Our tests also show that 158 of 253 first-order autocorrelation coefficients are positive, and among these 158 coefficients, 32 are significant at the 5 percent level.

25

Thus, as measured by first-order autocorrelation, there is clearly a tendency for a statistically significant number of PGA Tour participants to experience streaky play above and beyond any changes that might also be taking place in their estimated mean skill levels. Our tests also show that six of 95 first-order correlation coefficients are significantly negative at the 5 percent level. Thus, we have evidence of significant first-order dependence in residual golf scores for 38 of 253 golfers, or 15 percent of our sample of players. All 38 players, and the estimates of the first-order autocorrelation in their residual scores, are shown in Table 8. In addition to noting the frequency of positive first-order autocorrelation in θ residuals, we perform three direct tests for hot hands that have been used in the sports literature. The first of the three is a standard runs test. The second test, used by Albright (1993a) to test hitting streaks in baseball, tests whether a string of successive residual golf scores follows a first-order Markov chain. The third test, also used by Albright, employs a logistic regression model to test for hot hands in residual golf scores. All three tests are conducted on both θ and η residual scores. Due to the degree of autocorrelation in the θ residuals, we expect to find confirming evidence of hot hands from the additional tests as applied to these residuals. However, we do not expect to find evidence of hot/cold hands when applying these tests to the η residuals (where the effects of first-order autocorrelation have been removed). Also, each of these tests is performed on residual scores transformed into a {1, 0} binary form representing golf scores that are either lower (success) or higher (failure) than predicted by model 3. Although binary form is the only practical form for measuring success or failure when shooting basketballs or hitting baseballs, much of the information about the degree of success or failure in golf is lost when residual golf scores are transformed in this fashion. As a result, the power of the three tests applied to binary golf scores should be somewhat lower than our bootstrap tests for significant first-order autocorrelation in individual residual scores. 5.3.1 Runs Tests.

For both sets of residuals as applied to each player, let N1 denote the number of residual scores less than or equal to the expected residual score, given the residuals in the most recent run, N2 the number greater than the expected residual score, and R denote the total number of runs. Since the total of a player’s G residual scores must sum to zero, the expected residual score, given the residuals of the most recent run is given by: −sum of residuals in most recent run expected residual = . G− number of residuals in most recent run

26

For example, assume a player has recorded 100 scores. The expected value of the player’s initial residual score is zero. Suppose, that his actual initial residual score is +2. Then this establishes the sign of the first run which, in this case is positive. Knowing that the initial residual is +2 makes the expected value of any of the 99 remaining residual scores −2 / 99=− 0.0202 . Therefore, if the player’s next residual score exceeds –0.0202, the residual would be higher than expected, and the run would continue. Assume that the player’s next residual score is +1.5. Then the initial run would continue through the player’s second score. At this point, the total of the residuals of the run is 3.5 which implies an expected residual for any of the remaining 98 scores of −=−3.5/ 98 0.0357 . Thus, if the next residual exceeds –0.0357, the run would continue. Assume that the third residual is –0.8. Then the initial run would stop and a new run would begin. Given the –0.8 total residual of the new run, the next expected residual would be +0.8/ 99=+ 0.0081. Therefore, if the next residual is less than +0.0081, the second run would continue; otherwise it would stop. We also make the same adjustment to residual scores in the Markov chain test for hot hands described in the next section. Following Mendenhall, Scheafler and Wackerly (1981, p. 601), the expected number of runs and the standard deviation of the number of runs are as follows:

2NN ()(22NN NN−− N N ) E() runs=+12 1 and σ () runs =12 12 1 2. NN+ 2 12 ()()NN12++− NN 121 R − E( runs) For NN+>20 , the test statistic, Z = , is approximately normally distributed with a 12 σ ()runs mean of zero and standard deviation of 1.0. Although a two-sided runs test is typically used to test for independence in a time series, our concern is not whether residual golf scores are independent, but rather, whether players display tendencies to have fewer runs than would be expected from a purely random series. Thus, for each player we employ a one-sided test of whether Z < 0 . When runs tests are applied to θ errors, 28 of 253 Z-statistics are significant at the 5 percent level. This is consistent with our finding using the bootstrap tests that 32 of the 253 sets of θ errors display significantly positive first-order autocorrelation. As shown in Table 8, 15 of the 28 players are also among the 32 for whom the bootstrap tests for positive autocorrelation are significant. On the other hand, for the η residuals, which should represent white noise if an AR(1) process properly

27 captures the correlation structures of the θ errors, the runs test is significant at the 5 percent level for only six of 253 players. These results corroborate those of the bootstrap tests, suggesting that more players than can reasonably be expected by chance experienced hot/cold hands over our sample period. When the same tests are applied to η residuals, we find no additional evidence of hot and cold hands beyond the general tendencies captured by individual player cubic spline functions and their associated first-order autocorrelation parameters. 5.3.2 Markov Chain Test for Non-Random Residual Scores. Albright (1993) studies hitting streaks in baseball using several methods. One test focuses on whether a string of successive at-bats forms a first-order Markov chain. The test statistic is given by

2222 χ  M[],MM0,0 1,1− MM 1,0 0,1/ MM 0 1 (4) where M ij, is the number of times that that state i is followed by state j and M i is the number of times the ith state occurs. Under the null hypothesis of randomness, this statistic has a chi-squared distribution with one degree of freedom. In our application, the two states represent better and worse-than-average performance, as estimated by negative and positive θ and η residuals from model 3. We apply this method to the 253 golfers in our sample for both θ and η errors. When we use the θ residuals from the spline, 22 golfers (8.7%) have chi-square test statistics that reject the null hypothesis of no dependence at the 5% level. This two-sided test does not make a distinction between positive and negative statistical dependence. However, 17 of the 22 golfers are among the 28 for whom the runs tests as applied to θ residuals are significant, and as shown in Table 8, 11 of the 22 are among those whom the bootstrap tests identify as having significantly positive autocorrelated residuals. The remaining five players have the five most positive Z values in connection with the runs test and would be significant at the 5 percent level if the direction of the test were reversed. Also, as shown in Table 8, three of the five players are among the six identified by the bootstrap tests as having significantly negative autocorrelated residual scores. For the η residuals, we find that only four golfers show evidence of non-random residual scores. 5.3.3 Logistic Model Test for Hot Hands. Albright also considers the possibility that ‘situational’ variables might affect the onset (and perhaps persistence of) streaky play. In this setting, a logistic regression model may be used to isolate evidence of streaky play:

28

K ⎡⎤Pn ln⎢⎥=+αβHZn +∑ λkkn, , (5) ⎣⎦1− Pn k=1 where PPXnn= (== 1)and X n 1 indicates a better-than-normal golf round, ln[PPnn /(1 − )] is the odds of a better-than-normal golf round (i.e., a negative residual from model 3), Hn is a history of success variable, and Zkn, represents a sequence of k ‘situational’ variables for the nth golfer that affect the odds of success (i.e., proximity to the leader, etc.). If the β term is positive, the history of recent success raises the odds of success, an indication of streaky play. In their comment on the Albright paper, Stern and Morris (1993) argue that the logistic regression approach has low power and is plagued with a negative bias in the estimate of the slope coefficient on the history variable. Albright takes issue with a portion of their comment in his rejoinder. In his application to hitting streaks in baseball, Albright defined the history of success variable in a number of ways. In our setting, we measure recent history as the cumulated θ or η residuals from model 3 for the past 16 rounds (roughly four tournaments, a month’s time normally). We also experimented with four, eight, and 12 round cumulative performance, but the 16-round horizon produced the strongest case favoring streaky play. (Consistent with the previous two tests, −sum of16 previous scores we define success as a residual score that is less than .) As for G −16 ‘situational’ variables, we use the tournament week number, the tournament round (first, second, etc.), the number of strokes a golfer is behind the leader (which captures whether a golfer is in contention), and the interaction of the tournament round and distance from the leader (since being close to the leader at the beginning of play on the last day is more important than early in a tournament). We estimated (5) for all 253 golfers in our sample. Using θ residuals, we find that the history variable has a statistically significant impact on the probability of better than normal performance for 15 golfers in our sample using a five per cent test. Among this group, the average marginal effect of history on this probability is 1.03 percent with a standard deviation of 1.37 per cent. By contrast, the average marginal effect for the other 238 golfers is 0.143 percent with a standard deviation of 0.424 percent. Among the 15 golfers, the estimated coefficient for the history variable is positive for 11 players and negative for 4 players. This indicates that logistic regression detects hot hands for 11/253 or 4.3 percent of the players, roughly the number that would be expected by chance. Table 8 shows that, unlike the runs and Markov chains tests, there

29 is very little overlap among the players identified by the bootstrap and logistic regression tests as showing evidence of streaky play. Using η residuals, we find that the history variable had a statistically significant impact on the probability of better than normal performance for 19 golfers in our sample. The coefficient on the history variable is negative for four golfers, the same four as with the tests using θ residuals. Moreover, there is substantial overlap among the players with positive history coefficients for the two sets of tests. The results using η residuals are somewhat puzzling, since there should be less evidence of statistical dependence in a series of residuals after the effects of first-order correlation have been removed. Further analysis of the 19 golfers finds that they fit into several categories. First, among the 19 we find a substantial group of journeymen players, characterized by a higher tendency to miss the cut, a smaller number of rounds played, generally lower finishes, and correspondingly lower tour earnings. Second, we find a group of European players who play occasionally on the PGA tour. This group also had a relatively small number of rounds played over the sample period, with typically substantial gaps between tournaments played in the U.S. The final group of four players (J.L. Lewis, , Jeff Sluman and Glen Day) represents fairly accomplished, regular PGA tour players. The logistic regression tests identify an almost entirely different group of players than the runs and Markov chains test. This apparent conflict may reflect the sensitivity of spline model estimates to the nature of the underlying sample. 11 of the 19 players for whom the history variable from the logistic regression is significant played fewer than 200 rounds. A careful examination of their playing patterns indicates sporadic participation on the Tour. For some players, such as European Tour players , , and Ian Leggatt, and older players , Andy Bean and Lanny Wadkins, sporadic play on the PGA Tour is likely the result of choice. For the remaining three, , Bart Bryant and Eric Booker, the relatively small number of recorded scores results from missing cuts and not being eligible to play in a large number of PGA Tour events. And to a lesser extent, the same can be said for , Jimmy Green, Chris Smith and who recorded between 240 and 268 scores. Wang (1998) has shown that approximately 100 or more observations are required to obtain dependable statistical estimates using the cubic spline model. We had this in mind when we selected golfers to include in the sample, but temporal censoring for some golfers may have nontrivial effects. In the context of estimating a spline model, the non-observability of scores may cause misplacement

30 of join points and a sequence of positive or negative residuals. Dropping these players from the sample limits our ability to estimate course-related effects, so it isn’t obvious that this step is optimal. Careful examination of this potential and the sampling tradeoff is beyond the scope of this paper, but may constitute an interesting question for further research. In summary, we find that 158 of 253 first-order autocorrelation coefficients associated with θ residual scores are positive, and among these 158 coefficients, 32 are significant at the 5 percent level. Thus, there is clearly a tendency for a statistically significant number of PGA Tour participants to experience streaky play. We find confirming evidence of hot hands using a conventional runs tests and a Markov chain test. We also report results of tests for hot hands using a logistics regression model. Theses tests provide mixed signals about the degree of statistical dependence in both sets of residuals. The sporadic nature of play by many of the Tour players identified by these tests as having statistically-dependent residual scores may compound the low power problem noted by Morris and Stern (1993).

6. Summary and Conclusions In this paper we employ a generalized additive model to estimate cubic spline-based time- dependent mean skill functions and the first-order autocorrelation of residual scores about the mean (using the model of Wang [1998]) for 253 active PGA Tour golfers over the 1998 – 2001 period while simultaneously adjusting each score for the relative difficulty of each round-course interaction. Estimates of player-specific mean skill functions provide strong evidence that the skill levels of PGA Tour players change through time. For many players, the relationship between the player’s average skill level and time is well approximated by a linear time trend. For others, the relationship between mean player skill and time is more complex and cannot easily be modeled by a simple parametric- based time relationship. Using this methodology, we are able to rank golfers at the end of 2001 on the basis of their estimated mean skill levels and also provide point estimates of their end-of-2001 scores taking into account both the mean functions and the simultaneously estimated first-order autocorrelations in residual scores. Although there are some exceptions, our ranking of players at the end of 2001 is generally consistent with the Official World Golf Ranking. Using the model we are also able to identify the players who experienced the most improvement and deterioration in their skill levels over the 1998-2001 sample period. Generally the players that experienced the most significant deterioration in skill were the oldest players on the Tour.

31

We also address the role of luck, as defined by residual scores after adjusting for the effects of first-order autocorrelation, in determining scoring outcomes on the PGA Tour. We find that significant good or bad luck in one round does not carry over into the next round. Generally those who perform exceptionally well or poorly in a given round tend to perform at or near their norm in the next round. In addition, we find that even for the most highly skilled players, including Tiger Woods, skill alone is insufficient to win a golf tournament; a little luck is required for the most highly skilled players to win, and lots of luck is required for more average players to win. We also find that the particular course rotations to which players are assigned can have a significant effect in the outcomes of tournaments played on more than one course. For seven of 25 such tournaments, the rotations made a difference of less than 0.50 strokes per round, a difference sufficiently small keep the score for a typical player unaltered. At the extreme, however, there was a 5.45 stroke differential between the relative difficulties of the easiest and most difficult course rotation assignments. In our sample, the range of standard deviations in residual golf scores is 2.14 to 3.45 strokes per round, with a median of 2.69 strokes. As one might expect, John Daly and Phil Mickelson, both known for their aggressive play and propensities to take risks, are among the players with the highest standard deviations. We find that there is a tendency for higher standard deviations to be associated with slightly higher average scores. The possible existence of hot and cold hands, or streaky play, has been studied extensively in the statistical sports literature. In a series of papers, Clark has concluded that there is little evidence of streaky play in golf. In contrast, our results show that residual errors about the each player’s cubic spline-based mean skill function are positively correlated for 158 of 253 golfers, and among the 158, the autocorrelation is significant at the 5 percent level for 32 of the players. Thus, as measured by first-order autocorrelation, there is clearly a tendency for a statistically significant number of PGA Tour participants to experience streaky play. This evidence is confirmed by the results of runs and Markov chain tests used elsewhere in the sports literature.

32

Appendix Estimation of Model 3 Model 3 represents a generalized additive model that can be estimated using backfitting as described in Hastie and Tibshirani (1990). With backfitting, each coefficient (or smoothing function) is estimated separately using residuals computed conditional on the remaining coefficients (or smoothing functions). This process is continued iteratively until the coefficients (or functions) do not change. The procedure for estimating model 3 is designed to obtain an accurate iterative solution within a reasonable amount of time. Before describing the procedure, it should be noted that the estimation of one set of spline fits for each of the 253 players in the sample takes approximately two hours using a 3.19 GHz Windows-based PC. However, if the same splines are fit assuming that the first-order autocorrelation coefficients, φi , are zero for each of the 253 players, a single set of 253 splines takes approximately one minute to estimate. Therefore, to the extent possible, we attempt to perform the bulk of our iterative backfitting calculations assuming φ12= φφ===....n 0.

′ Denote the vector of round-course interaction coefficients as ω = ()ωω12, ,.... ωm . Let

0000 0 ωω= ()f12,ff ,...., n denote the iterative solution value for ω when φ12====φφ....n 0 in

00 0 terms of the corresponding spline solution values, f12,ff ,...., n . Similarly, let

φφ φ ωωii= f 00,ffff ,...., i , 0 ,...., 0 denote the solution value for ω in terms of ( 12ii+ 1 n)

00φi 0 0 φi f12,ffff ,....,ii ,+ 1 ,...., n, where fi denotes the spline fit of golf scores for player i, adjusted for

0 round-course interactions, ω , while allowing the first-order autocorrelation parameter, φi , to be determined freely with all other values of φ set to zero. Finally, let ω∗ denote the final iterative

* solution value for ω in terms of each final spline function, fi , and its associated first-order

* autocorrelation parameter, φi . The mathematical basis for our solution procedure draws on the following. ∂∂ωω ∂ ω dddω =+φ12φφ ++.... dn ∂∂φφ12 ∂ φn (A1) ∂∂ωω ∂ ω ≈∆+∆++∆φ12φφ....n , ∂∂φφ12 ∂ φn

33

Equation A1 should hold, provided there is no functional relationship among the various correlation parameters. Given the nature of the game of golf, one should not expect the autocorrelation parameter for one player to be functionally related to those of other players.

0 00 0 Assume that ω and associated solution values f12,ff ,...., n are known. Then, from equation A1,

∗0∂ωω∂∂ ω ∆−≈∆+∆++∆ωω= ω φ12φφ....n . (A2) ∂∂φφ12 ∂ φn

∂ω φi 0 ∂ω Note that ∆≈φi ωω − . Substituting for ∆φi in equation A1 and re-arranging yields: ∂φi ∂φi

φφ ωω∗0≈+−+( ω12 ω 0) ( ω −++− ω 0) ... ( ωφn ω 0) (A3) n φ ≈−−∑ω i ()n 1.ω0 i=1 The right-hand side of equation A3 becomes our first estimate of the final solution vector of random round-course interaction coefficients, ω∗ . As it turns out, very little additional backfitting is necessary beyond the initial estimate from A3. The following three-phase backfitting process is employed to obtain our final solution estimates. Throughout, all cubic splines are estimated in R using REML in connection with the ssr function in the assist package of Wang and Ke (2004). Random round-course interaction coefficients are estimated in R using REML as implemented in the lme function in the new lme4 package of Pinheiro and Bates by first estimating a model of the

m ′ ′ ′ following form, spline_ adjij,,=+∑ω c r cκ ij, and then computing ωcc= ωω− 1 , for cm=1,.... . c=1 Phase 1 The purpose of phase 1 is to estimate solution values for the round-course interaction vector

0 00 0 ω and associated cubic spline-based functions of mean player skill, f12,ff ,...., n , under the assumption that φ12====φφ....n 0 .

1. For each player, i, and round-course interaction j, compute spline_ adjij,,=− s ij fii( g[ j]) , which represents the score of player i associated with round-course interaction j after adjusting for the spline-based estimate of the player’s time-dependent mean skill level. For

the first iteration, assume fji ()gi [ ] = 0 ; otherwise use the values of fi ()gi [ j] as estimated in step 4 below.

34

2. Using the full set of spline-adjusted scores computed in step 1, estimate each round-course

interaction coefficient, ω j , from the following random effects model: m spline_ adjij,,=+∑ω cc r κ ij. c=2 Stop if convergence has occurred.

3. For each player, i, and round-score interaction coefficient j as estimated in step 2, compute

r_ adjij,,=− s ijω j, which represents the score of player i associated with round-course interaction j after adjusting for the effects of round-course interaction j.

4. Estimate the cubic spline-based skill level for each player, i, in terms of r_ adjij, as

computed in step 3 such that r_ adjij, = fii( gj[ ]) +θ ii( gj[ ]).

5. Return to step 1.

Phase 2 In phase 2 we compute the initial estimate of ω∗ using the methodology defined by equation A3.

0 1. For each player, i, compute r_ adjij,,= s ij−ω j, which represents the score of player i after adjusting for the effects of round-course interaction j as estimated in the solution vector ω0 in phase 1.

2. Estimate the cubic spline-based skill function for each player, i, in terms of the full set of

r_ adjij, values computed in step 1, such that: r_, adjijii, =+ f( g[ j]) θ ii() g[ j] with

θii()gj[ ] = ϕηkk()gjkk[ ] + () gj[ ] . In this step, φi is not set to zero and is estimated jointly

with fii()gj[ ] . φφ 3. Given the results from step 2, estimate ω iif 00,ff ,..., ,...., f 0 for each player i. ( 12 in) 4. Given the estimated solution value for ω0 from phase 1 and the results from step 3, compute ∗ the estimated value of ω using equation A3.

Phase 3

In phase 3 we employ backfitting to obtain greater precision in the estimation of ω∗ .

1. For each player, i, and round-score interaction j, compute r_ adjij,,= s ij−ω j using the most

recently estimated value of ω j .

35

2. Estimate the cubic spline-based skill function for each player, i, in terms of the full set of

r_ adjij, values as computed in step 1 such that: r_, adjijii, =+ f( g[ j]) θ ii() g[ j] with

θii()gj[ ] = ϕηkk()gjkk[ ] + () gj[ ] .

3. For each player, i, and round-course interaction j, compute:

spline_ adjij,,=− s ij( fii⎣⎦⎡⎤g i( jgj) +ϕ [ i( )]) ,

which represents the score of player i associated with round-course interaction j after adjusting for the spline-based estimate of the player’s time-dependent mean skill level and the predicted residual error from the spline fit as estimated in step 2.

4. Using REML estimate the effects of each round-course interaction, ω j , from the following random effects model using the full set of spline-adjusted scores from step 3: m spline_ adjij,,=+∑ω cc r κ ij. c=2 Stop if convergence has occurred. Otherwise, return to step 1.

We employ 50 iterations in the phase 1 portion of the estimation procedure. After the 50th iteration, the maximum change from the previous iteration in the absolute values of the 848 round- course interaction coefficients is 0.00201, and the sum of 848 squared changes in the same coefficients is 0.00110. Starting with the 14th iteration, both of these values decrease monotonically, starting at 0.0091 and 0.01785, respectively.

In phase 2, we compute the initial estimate of ω∗ as defined in equation A3. The solid line in figure A1 shows the amount of change in round-course interaction coefficients between the estimates from phase 2 and the final iterative solution values from phase 1. The maximum absolute change among the 848 coefficients during this phase is 0.13763. After computing the initial estimate of ω∗ , we employ two iterations in phase 3 to obtain our final model solution estimates. In the second of the two iterations, the maximum change from the previous iteration in the absolute values of the 848 round-course interaction coefficients is 0.00671, and the sum of 848 squared changes in the same coefficients is 0.00752. The scatter plots in figure A1 show the amount of change in the same coefficients represented by the solid line during phase 3.

Changes in Round-Course Interaction Coefficients In Phases 2 and 3 of the Estimation of Model 3

0.15

Phase 2 0.00 Phase 3

-0.15 0 200 400 600 800 Coeffcients Ordered by Change in Phase 2 Figure A1. Change in round-course interaction coefficients in phases 2 and 3 of the estimation of model 3.

Coefficient changes in phase 2 are ordered from lowest to highest. Phase 3 changes are shown for the same coefficient ordering.

References

Albright, S. C. (1993a), “A Statistical Analysis of Hitting Streaks in Baseball,” Journal of the American Statistical Association, 88, 1175 – 1183.

______(1993b), “Rejoinder,” Journal of the American Statistical Association, 88, 1194 – 1196.

Berry, S. M. (2001), “How Ferocious is Tiger?,” Chance, 51 – 56.

Clarke III, R. D. (2002a), “Do Professional Golfers ‘Choke’?,” Perceptual and Motor Skills, 94, 1124 – 1130.

______(2002b), “Evaluating the Phenomenon of Choking in Professional Golfers,” Perceptual and Motor Skills, 95, 1287 – 1294.

______(2003a), “Streakiness Among Professional Golfers: Fact or Fiction?,” International Journal of Sports Psychology, 34, 63 – 79.

______(2003b), “An Analysis of Streaky Performance on the LPGA Tour,” Perceptual and Motor Skills, 97, 365 – 370.

______(2004a), “On the Independence of Golf Scores for Professional Golfers,” Perceptual and Motor Skills, 98, 675 – 681.

______(2004b), “Do Professional Golfers Streak? A Hole-to-Hole Analysis,” ASA Section on Statistics in Sports, 3207 – 3214.

Efron, B. and Tibshirani,R. J. (1998), An Introduction to the Boostrap, (Boca Raton, Florida: Chapman and Hall/CRC).

Gilden, D. and Wilson, S. (1995), “Streaks in Skilled Performance,” Psychonomic Bulletin & Review, 2, 260 – 265.

Gilovich, T., Vallone, R., and Tversky, A. (1985), “The Hot Hand in Basketball: On the Misperception of Random Sequences,” Cognitive Psychology, 17, 295 – 314.

Hastie, T. J. and Tibshirani, R. J. (1990), Generalized Additive Models (London, UK: Chapman and Hall).

Jenkins, S. (2005), “It’s Best to be Lucky and Good,” Washington Post, April 12, D01.

Klaassen, F. and Magnus, J. (2001), “Are Points in Tennis Independent and Identically Distributed? Evidence From a Dynamic Binary Panel Data Model,” Journal of the American Statistical Association, 96, 500 – 509.

Larkey, P., Smith, R., and Kadane, J. (1989), “It’s Okay to Believe in the ‘Hot Hand’,” Chance, 2, 35 – 37.

Ljung, G. M. (1986), “Diagnostic Testing of Univariate Time Series Models,” Biometrika, 73, 725 – 730.

______and Box, G. E. P. (1978), “On a Measure of Lack of Fit in Time Series Models, Biometrika, 65, 297 – 303.

Mendenhall, W., Scheafler, R. L. and Wackerly, D. D. (1981), Mathematical Statistics with Applications (Boston, MA: Duxbury Press).

Pinherio, J. C. and Bates, D. M. (2000), Mixed Effects Models in S and S-Plus (New York, NY: Springer).

Stern, H. S. and Morris, C. N. (1993), “Comment,” Journal of the American Statistical Association, 88, 1189 – 1193.

Wardrop, R. L. (1995), “Simpson’s Paradox and the Hot Hand in Basketball,” The American Statistician, 49, 24 – 28.

Wang, Y. (1998), “Smoothing Spline Models With Correlated Random Errors,” Journal of the American Statistical Association, 93, 341 – 348.

Wang, Y. and Ke, C. (2004), “ASSIST: A Suite of S Functions Implementing Spline Smoothing Techniques,” May 18.

Wang, Y., and Wahba, G. (1995), “Bootstrap Confidence Intervals for Smoothing Spline Estimates and Their Comparison to Bayesian Confidence Intervals,” Journal of Statistical Computation and Simulation, 51, 263 – 279.

Table 1 Model Identification Using Full Sample Using Sample of Players with More than 90 Rounds Model 2 Using … Model 2 Using … Round- Round- Course Actual Course Actual Interactions Rounds Tournaments Courses Interactions Rounds Tournaments Courses Model Model 1 2.1 2.2 2.3 2.4 Model 1 2.1 2.2 2.3 2.4 First-year skill dummies 810 810 810 810 810 253 253 253 253 253 Second-year skill dummies 259 259 259 259 245 245 245 245 Third-year skill dummies 197 197 197 197 196 196 196 196 Fourth-year skill dummies 128 128 128 128 128 128 128 128 Round-course dummies 848 848 848 848 Round dummies 728 728 Tournament dummies 182 182 Course dummies 77 77 Total dummy variables 1,658 2,242 2,122 1,576 1,471 1,101 1,670 1,550 1,004 899 Players in sample 810 810 810 810 810 253 253 253 253 253 Total observations in sample 75,054 75,054 75,054 75,054 75,054 64,364 64,364 64,364 64,364 64,364

Adj R2 0.2996 0.3077 0.3023 0.2639 0.2287 0.2682 0.2783 0.2726 0.2144 0.1931 F-test 20.39 15.89 16.34 16.74 16.15 22.46 15.88 16.58 18.53 18.18 Models tested 2.1 vs. 1 2.1 vs. 2.2 2.1 vs. 2.3 2.1 vs. 2.4 2.1 vs. 1 2.1 vs. 2.2 2.1 vs. 2.3 2.1 vs. 2.4 F-test model difference 2.44 5.66 9.66 10.67 2.57 5.14 8.67 9.59

Table 2 Spline Model vs. Four Alternative Specifications of Mean Player Skill Spline Model vs. Mean per Linear Quadratic Mean Calendar Year Time Trend Time Trend Proportion significant at 5 percent level 0.2964 0.0474 0.0791 0.0474 Proportion of players for which alternative model 0.0751 0.0474 0.3004 0.1670 is superior to spline model NOTE: The mean per calendar year is based on the same approximate calendar year specification as in model 2. The proportion of players for which alternative model is superior to the spline model is the proportion for which ζˆ is positive for less than 20 of 40 bootstrap samples per player.

Table 3 Best 20 Players at the end of the 2001 Season Based on Spline Estimates of 18-hole Scores Adjusted for Round-Course Interactions Spline-Based Estimate of Score Adjusted for Round-course Interactions as of 11/04/01 Not adjusted for Adjusted for Official autocorrelation in autocorrelation in World Golf Ranking residuals residuals as of 11/04/01 (1) (2) (3) 1. TIGER WOODS 67.57 67.93 1 2. BOB ESTES 67.87 67.85 18 3. PHIL MICKELSON 68.46 68.40 2 4. SERGIO GARCIA 68.47 68.28 6 5. VIJAY SINGH 68.85 68.90 8 6. 68.92 69.04 14 7. DAVIS LOVE III 69.04 69.10 5 8. 69.09 69.09 13 9. CHARLES HOWELL III 69.17 69.18 45 10. CHRIS RILEY 69.19 69.32 96 11. SCOTT MCCARRON 69.21 69.04 47 12. CHRIS DIMARCO 69.23 69.27 19 13. 69.24 69.28 23 14. 69.28 69.10 7 15. 69.33 69.35 32 16. DAVID DUVAL 69.34 69.08 3 17. DARREN CLARKE 69.35 69.14 10 18. 69.42 69.43 4 19. 69.49 69.63 34 20. 69.49 69.66 20

Table 4 Players Showing the Most Improvement and Deterioration in Skill Over the 1998-2001 Sample Period Improving Skill Deteriorating Skill Skill Skill Player Change Player Change CAMERON BECKMAN -3.36 DAVID EDWARDS 2.32 BRIAN GAY -3.15 STEVE JURGENSEN 2.60 J.J. HENRY -2.87 TOM WATSON 2.64 -2.79 CRAIG STADLER 2.65 CHRIS DIMARCO -2.79 CRAIG A. SPENCE 2.70 MIKE WEIR -2.63 BOBBY WADKINS 2.74 BRINY BAIRD -2.55 KEITH FERGUS 2.84 DAVID PEOPLES -2.44 FUZZY ZOELLER 2.99 DARREN CLARKE -2.39 LANNY WADKINS 8.18 BERNHARD LANGER -2.36 BILLY RAY BROWN 8.95 Note: Skill change is difference in the spline-based estimate of the players’ mean skill level from the beginning to the end of the player’s golf time. Golf time is specific to each player and does does not necessarily begin at the start of 1998 and end at the end of 2001.

Table 5 Analysis of Residual Scores in Adjacent Rounds Not Divided by a Cut Regression of Regression of round 2 residual on round 2 residual on Average θ Residual round 1 residual Average η Residual round 1 residual Sort Reg Reg Group Obs. Round 1 Round 2 coef F p-value Round 1 Round 2 coef F p-value 1 1471 -5.184 -0.097 -0.012 0.030 0.864 -5.183 0.060 -0.043 0.406 0.524 2 1503 -3.859 -0.135 -0.140 1.248 0.264 -3.847 0.016 -0.054 0.183 0.669 3 1538 -3.149 -0.027 0.037 0.055 0.815 -3.146 -0.003 -0.006 0.002 0.969 4 1576 -2.612 -0.096 -0.187 1.152 0.283 -2.613 -0.029 -0.416 5.350 0.021 5 1615 -2.151 -0.043 -0.257 1.593 0.207 -2.145 0.004 -0.124 0.383 0.536 6 1690 -1.746 -0.111 -0.060 0.082 0.774 -1.742 -0.052 -0.089 0.169 0.681 7 1730 -1.356 -0.016 -0.083 0.130 0.719 -1.351 0.023 -0.245 1.174 0.279 8 1758 -0.991 -0.003 0.015 0.004 0.948 -0.985 0.035 0.275 1.356 0.244 9 1795 -0.633 -0.092 -0.089 0.125 0.723 -0.630 -0.055 -0.278 1.256 0.263 10 1820 -0.282 -0.014 0.078 0.102 0.750 -0.281 -0.154 -0.035 0.019 0.889 11 1809 0.075 -0.021 0.114 0.228 0.633 0.072 0.072 0.131 0.302 0.583 12 1775 0.443 0.034 0.355 2.417 0.120 0.439 0.088 0.344 2.251 0.134 13 1738 0.821 -0.135 -0.001 0.000 0.996 0.819 -0.207 0.009 0.002 0.965 14 1712 1.217 0.099 0.150 0.548 0.459 1.215 0.005 -0.098 0.242 0.623 15 1645 1.610 -0.135 -0.060 0.105 0.746 1.608 -0.041 -0.242 1.734 0.188 16 1593 2.047 0.072 0.014 0.006 0.939 2.035 -0.030 0.025 0.021 0.884 17 1553 2.555 0.141 -0.114 0.480 0.489 2.547 0.053 -0.055 0.120 0.729 18 1521 3.156 0.059 0.144 1.006 0.316 3.152 0.026 -0.017 0.015 0.902 19 1487 3.964 0.219 -0.005 0.002 0.964 3.955 0.025 -0.032 0.082 0.774 20 1457 5.699 0.328 0.121 5.051 0.025 5.681 0.207 0.058 1.188 0.276 NOTES: When the total number of scores in round 1 is not evenly divisible by 20, the extra scores are assigned first to group 10, then to groups 11, 9, 12, 8, 13, etc. Therefore, there are more scores recorded for the middle groups than for those at either end.

We adjust each residual score in the second of two adjacent rounds by adding the first-round residual divided by the number of rounds recorded by the same player minus 1. This adjustment allocates the built-in bias from conditioning on the first-round residual evenly among the remaining rounds for the same player. The adjustment reflects that the sum of residual scores for each individual player is zero. As such, if the residual score in the first of two adjacent rounds is positive, the expected score in each of the remaining rounds must be slightly negative, and if the first-round score is negative, the expected second-round score must be slightly positive.

Table 6 Actual Scores and θ Residual Scores for Top 40 Finishers In 2001 Players Championship Actual Scores Residual Scores Player Day 1 Day 2 Day 3 Day 4 Total Day 1 Day 2 Day 3 Day 4 Total TIGER WOODS 72 69 66 67 274 1.809 -2.230 -4.327 -3.195 -7.942 VIJAY SINGH 67 70 70 68 275 -4.489 -2.529 -1.627 -3.495 -12.140 BERNHARD LANGER 73 68 68 67 276 0.957 -5.066 -4.147 -5.000 -13.256 69 66 70 73 278 -3.893 -7.933 -3.031 0.101 -14.756 72 71 68 70 281 -0.252 -2.295 -4.395 -2.267 -9.208 68 72 70 71 281 -4.847 -1.885 -2.981 -1.848 -11.560 FRANK LICKLITER 72 72 70 68 282 -0.725 -1.765 -2.864 -4.733 -10.087 SCOTT HOCH 67 70 71 74 282 -5.204 -3.243 -1.339 1.793 -7.992 PAUL AZINGER 66 70 74 72 282 -5.983 -3.020 1.885 0.019 -7.099 73 73 67 70 283 0.460 -0.583 -5.683 -2.554 -8.360 NICK PRICE 70 74 71 68 283 -1.870 1.088 -1.010 -3.880 -5.673 DAVID TOMS 70 77 66 71 284 -2.006 3.955 -6.141 -1.008 -5.200 JOSE MARIA OLAZABAL 71 76 68 69 284 -1.808 2.147 -4.955 -3.828 -8.445 71 71 72 70 284 -1.115 -2.156 -0.255 -2.125 -5.650 73 71 71 70 285 -0.761 -3.830 -2.956 -3.853 -11.399 SCOTT DUNLAP 70 73 73 69 285 -3.186 -1.231 -0.333 -4.206 -8.955 77 67 69 73 286 4.107 -6.938 -4.040 0.087 -6.784 KENNY PERRY 71 66 74 75 286 -1.195 -7.234 1.670 2.803 -3.955 67 72 76 71 286 -5.969 -2.010 2.892 -1.978 -7.064 73 72 69 73 287 -0.339 -2.369 -4.456 -0.314 -7.478 TIM HERRON 73 74 71 69 287 0.012 -0.031 -2.132 -4.004 -6.155 68 75 71 73 287 -4.680 1.279 -1.819 0.312 -4.908 JIM FURYK 72 75 72 68 287 -0.064 1.892 -0.210 -4.082 -2.464 J.P. HAYES 72 69 76 70 287 -1.402 -5.443 2.458 -3.413 -7.800 68 78 69 73 288 -4.896 4.061 -4.040 0.088 -4.786 JOE OZAKI 77 68 72 71 288 3.116 -6.932 -2.037 -2.913 -8.766 DARREN CLARKE 75 70 72 71 288 2.706 -3.312 -0.388 -1.235 -2.229 BRAD FAXON 72 74 73 69 288 -0.721 0.239 0.142 -3.727 -4.067 74 70 73 72 289 0.620 -4.426 -0.530 -1.404 -5.740 72 71 76 70 289 -0.410 -2.451 3.452 -2.417 -1.826 MARK WIEBE 73 73 69 75 290 -1.144 -2.189 -5.291 0.835 -7.789 PHIL MICKELSON 73 68 72 77 290 1.754 -4.283 0.623 5.758 3.852 74 73 75 68 290 1.416 -0.625 2.278 -4.591 -1.523 JEFF SLUMAN 72 71 75 72 290 -0.483 -2.525 2.376 -0.495 -1.126 71 73 76 70 290 -2.058 -1.102 2.798 -3.074 -3.436 FRED FUNK 70 71 77 72 290 -2.718 -2.761 4.138 -0.733 -2.074 BRIAN GAY 73 74 72 72 291 0.178 0.146 -0.944 -0.804 -1.425 BOB TWAY 72 73 74 72 291 -0.882 -0.919 0.987 -0.878 -1.693 COLIN MONTGOMERIE 71 71 75 74 291 -1.317 -2.358 2.543 1.673 0.541 STEVE FLESCH 70 73 76 72 291 -2.429 -0.470 3.432 -0.438 0.096

Table 7 Twenty Highest Total Residual Scores of Tournament Winners or of Players who Tied for First Tournament Tournament Winning Score Winner or Tie for First θ Residual 99 DISNEY 271 TIGER WOODS -2.14 99 TOUR CHAMP 269 TIGER WOODS -2.73 00 ATT 273 TIGER WOODS -3.23 01 BUICK INV 269 PHIL MICKELSON -3.29 01 NEC 268 TIGER WOODS -3.36 01 270 SERGIO GARCIA -3.53 00 BELL SOUTH 205 PHIL MICKELSON -3.58 99 AM EXPRESS 278 TIGER WOODS -3.79 98 HOUSTON 276 DAVID DUVAL -3.96 98 HOUSTON 276 DAVID DUVAL -3.96 01 BUICK INV 269 DAVIS LOVE III -4.09 01 BAY HILL 273 TIGER WOODS -4.20 00 266 TIGER WOODS -4.63 99 NEC 270 TIGER WOODS -4.77 01 TOUR CHAMPIONSHIP 270 SERGIO GARCIA -4.84 01 MASTERS 272 TIGER WOODS -4.84 01 HARTFORD 264 PHIL MICKELSON -4.90 99 WESTERN 273 TIGER WOODS -5.09 98 BELLSOUTH 271 TIGER WOODS -5.17 00 WESTCHESTER 276 DAVID DUVAL -5.18

Table 8 Summary of Tests of Hot Hands Among Players with Significant Positive and Negative First-Order Autocorrelation in Residual Golf Scores Markov Logistic Regression Runs Test Chain History φ Z-value p-value p-value coefficient p-value LEE WESTWOOD 0.35614 -2.66 0.004* 0.032* 0.0547 0.2973 MARK CALCAVECCHIA 0.18480 -1.95 0.025* 0.054* -0.0110 0.2937 PHIL TATAURANGI 0.18450 -2.12 0.017* 0.025* 0.0372 0.0118* KEVIN WENTWORTH 0.16009 -3.60 0.000* 0.001* 0.0227 0.1351 JERRY KELLY 0.15943 -2.26 0.012* 0.033* -0.0065 0.5102 SERGIO GARCIA 0.15359 -0.83 0.204 0.436 -0.0211 0.3437 MARK WIEBE 0.15355 -2.79 0.003* 0.008* -0.0095 0.3846 0.15261 -2.04 0.020* 0.139 -0.0038 0.7685 DICKY PRIDE 0.14799 -1.88 0.030* 0.038* -0.0116 0.3083 0.14342 -0.34 0.366 0.787 -0.0047 0.6875 LEE PORTER 0.14325 -2.11 0.018* 0.050* -0.0033 0.8130 0.14027 -2.04 0.021* 0.032* -0.0015 0.9321 DOUG DUNAKEY 0.13843 -0.87 0.193 0.566 -0.0096 0.4292 MIKE HULBERT 0.13725 -1.41 0.079 0.136 -0.0059 0.6153 0.13719 -1.50 0.066 0.180 0.0170 0.2568 GABRIEL HJERTSTEDT 0.13171 -3.13 0.001* 0.004* -0.0181 0.0653 TIGER WOODS 0.12824 -3.24 0.001* 0.001* -0.0206 0.0733 RICK FEHR 0.12793 -1.88 0.030* 0.085 -0.0042 0.7236 BILL GLASSON 0.12556 -0.10 0.459 0.884 -0.0234 0.0620 0.11898 -1.45 0.074 0.315 -0.0204 0.1191 TIM LOUSTALOT 0.11886 -0.26 0.398 0.589 0.0141 0.5153 JOE DURANT 0.11685 -0.43 0.332 0.871 0.0010 0.9135 SCOTT SIMPSON 0.11649 -0.76 0.223 0.394 0.0324 0.1608 WOODY AUSTIN 0.11347 -1.66 0.048* 0.082 -0.0058 0.6021 0.10983 -1.40 0.081 0.143 0.0091 0.3713 FRANK LICKLITER 0.10228 -1.48 0.069 0.177 -0.0080 0.4403 J.P. HAYES 0.10188 -0.69 0.247 0.601 0.0072 0.5366 FRANKLIN LANGHAM 0.09917 0.25 0.598 0.680 0.0158 0.1787 JIM FURYK 0.09459 -1.71 0.043* 0.146 0.0214 0.0900 BILLY MAYFAIR 0.08988 -1.31 0.094 0.201 -0.0004 0.9616 BLAINE MCCALLISTER 0.08592 -0.82 0.207 0.439 -0.0071 0.5587 SCOTT HOCH 0.08304 -0.97 0.167 0.357 -0.0040 0.7453

SCOTT MCCARRON -0.08293 1.11 0.866 0.242 -0.0021 0.8558 R.W. EAKS -0.14154 2.63 0.996 0.009* -0.0027 0.9101 COLIN MONTGOMERIE -0.16190 0.63 0.736 0.458 0.0485 0.0436* -0.20396 1.52 0.936 0.057* 0.0207 0.3466 CLARK DENNIS -0.21934 2.19 0.986 0.031* 0.0084 0.7559 IAN LEGGATT -0.27914 3.21 0.999 0.003* 0.0615 0.2086 * Indicates statistical significance at the 5 percent level.

1,400

1,200

1,000

800

600

400 Cumulative Number of Players of Number Cumulative

200

0 0 50 100 150 200 250 300 350 400 450 Rounds per Player 1998-2001

80,000

70,000

60,000

50,000

40,000

30,000

20,000 Cumulative Number of Rounds of Number Cumulative

10,000

0 0 50 100 150 200 250 300 350 400 450 Rounds per Player 1998-2001

Figure 1. Distribution of cumulative number of players and cumulative number of rounds recorded on the PGA Tour in 1998-2001 as a function of the total number of rounds recorded per player.

Figure 2. Spline-based mean scores adjusted for round-course interactions as a function of scaled golf time.

Figures in the upper panel show 18-hole scores reduced by the effects of round-course interactions (connected by jagged lines) along with corresponding spline fits (smooth lines). The same spline fits (smooth lines) are shown in the lower panel along with predicted 18-hole scores adjusted for round-course interactions (connected by jagged lines), computed as a function of prior residual errors and the first-order autocorrelation coefficient estimated in connection with the spline fit.

Figure 3. Summary characteristics of cubic spline fits in model 3.

Figure 4. Standard deviation of residual 18-hole scores (θ ) among 253 PGA Tour Players.

Figure 5. Distribution of round-course interaction coefficients.

Figure 6. Distribution of greatest advantage in multiple-course tournaments.