Quick viewing(Text Mode)

FORECASTER's FORUM Discussion Of

FORECASTER's FORUM Discussion Of

AUGUST 2004 FORECASTER'S FORUM 769

FORECASTER'S FORUM

Discussion of Veri®cation Concepts in Forecast Veri®cation: A Practitioner's Guide in Atmospheric Science

BOB GLAHN Meteorological Development Laboratory, National Service, Silver , Maryland

16 December 2003 and 9 February 2004

The book Forecast Veri®cation: A Practitioner's Bureau in 1958. In the Alaskan Weather Center of the Guide in Atmospheric Science, edited by Jolliffe and U.S. Air , as the ®rst numerical weather predic- Stephenson (2003, hereafter JS03), ®lls a void in ver- tion (NWP) ``progs'' were rolling out, we were using i®cation of meteorological and forecasts. While a score similar to the S1 score (JS03, p. 129; Teweles a number of books on aspects of related to and Wobus 1954). Roger Allen (for whom I worked and (e.g., Wilks 1995) discuss for several years) and Jack Thompson hired me, and veri®cation, this complete book is devoted to the sub- my of®ce was just down the hall from Glenn Brier and ject. The book comprises a fairly tightly coupled set of Thompson. All three had recently published what has chapters written by generally well-known experts, in turned out to be landmark papers (Brier 1950; Thomp- some cases perhaps more so in Europe than North Amer- son 1952; Thompson and Brier 1955; Brier and Allen ica, in veri®cation and especially in the subjects of their 1951); all except Thompson and Brier are referenced particular chapters. In a book in which sections or chap- in JS03. Over the years, I have watched the veri®cation ters are written by different authors, one asks the fol- literature grow to what it is today. I certainly agree lowing questions: 1) how well do the individual chapters with JS03 that ``Allan Murphy had a major impact on read and present the material logically, accurately, and the theory and practice of forecast veri®cation'' (p. 3). comprehensively; and 2) how well do the chapters relate Murph was a proli®c writer, maintaining over long pe- to one another and address the full subject of the book? riods a paper a month. He collaborated with many oth- Regarding the ®rst question, JS03 gets high marks for ers of renown and touched on most subjects relating most chapters. On the second question, JS03 is better to forecast veri®cation. The one topic with which he than many, although the editors have not suppressed had not gotten entirely comfortable was forecasts of individuality enough in some instances for it to read like spatial ®elds (Allan Murphy, personal communica- a fully cohesive book. JS03 is a voluminously refer- tion), although he and Ed Epstein de®ned a skill score enced and well-indexed survey of what is known about, for model veri®cation (Murphy and Epstein 1989). Per- and a historical account of, veri®cation and the related haps the single most important paper he coauthored topic evaluation as it exists in the meteorological lit- was the landmark paper, ``A general framework for erature. The editors have put much emphasis on stan- forecast veri®cation'' (Murphy and Winkler 1987), dardizing mathematical notation throughout, and were mentioned by JS03 in chapter 1. quite successfulÐan achievement in itself. While the A possible runner-up in importance to the Murphy methods presented can be applicable to most any fore- and Winkler paper in the meteorological veri®cation lit- casting problem, the discussion and examples are tied erature was the introduction of the relative operating to forecasting as acknowledged by characteristic (ROC) into meteorology. While Murph JS03 (preface), which hardly translates into the full embraced this concept, it was ®rst brought into the me- scope of ``atmospheric science.'' teorological literature by Ian Mason (1980, 1982a,b), I have been interested in and involved with veri®- who reported and built upon the work of John Swets cation even before my entry into the U.S. Weather (1973). John and Ian were two of the invitees to a work- shop on probabilistic weather forecasts in 1983 at Tim- Corresponding author address: Dr. Harry R. Glahn, Meteorolog- ical Department Laboratory, National Weather Service, W/OST2, berline Lodge on Mount Hood, Oregon, organized by 1325 East±West Highway, Silver Spring, MD 20910. Murphy. ROC has not played as major a role in the past E-mail: [email protected] as such scores as versions of the skill score and threat 770 WEATHER AND FORECASTING VOLUME 19 score (sometimes with different names), but it is begin- work would get many citations, the number and ning to come to the forefront with the recognition and diversity is so great that it is a dominant thread in use of probability information. Even the terminology JS03. base rate, hit rate, and false alarm rate have come into Attribution to other authors. JS03 provides many ci- prominence in meteorological forecast veri®cation tations to previous works, which can be very help- largely through the in¯uence of ROC. For instance, JS03 ful to those delving into details of veri®cation and in the de®nition for hit rate states that it is ``Also known evaluation of meteorological and climate forecasts. as probability of detection in older literature.'' I would counter that most readers and developers associated with In chapter 1, the editors reiterate Brier and Allen's atmospheric science are more familiar with ``probability (1951) reasons for veri®cation; use their terms ``eco- of detection'' than they are with ``hit rate.'' Maybe that nomic,'' ``administrative,'' and ``scienti®c,'' and note is because they, too, are ``older.'' that a common theme is that any veri®cation scheme I have identi®ed a number of recurring themes or needs to be informative (p. 4). They note that it is highly central ideas in JS03 mentioned below: desirable that the veri®cation system be objective; they examine various scores according to attributes reliabil- Finley's forecasts. The now-famous 2 ϫ 2 ity, resolution, discrimination, and sharpness, as sug- of Finley's (1884) yes/no tornado forecasts is gested by Murphy and Winkler (1987), and for ``good- introduced on page 1 and is discussed several times. ness'' of which Murphy (1993) identi®ed three typesÐ The table even appears almost subliminally on the consistency, quality (e.g., accuracy or skill), and value cover. JS03 states ``. . . there is a surprisingly large (utility); and they note that in order to quantify the value number of ways in which the numbers in the four of a forecast, a baseline is needed, and that persistence, cells . . . can be combined to give measures of the climatology,1 and randomization are common baselines. quality of the forecasts.'' I might add that the well-established objective method Veri®cation presented from a developer's viewpoint. of Model Output Statistics (MOS) produces an impor- Much of the discussion seems to have as an ob- tant and more competitive baseline for many forecasts, jective developing or improving a forecast system especially in the National Weather Service in the United rather than judging the, possibly comparative, States; but, this is not mentioned in JS03. goodness of a set of forecasts. While both aspects While the idea of a baseline is important and seem- are important, JS03 does not clearly make the dis- ingly a simple concept, even climatic forecasts as a base- tinction, and I would have expected concentration line need more de®nition, because different ``de®ni- to be heavily on the latter rather than the former. tions'' can give quite different results. For instance, in Strong emphasis on the ROC and its associated ter- verifying temperature forecasts over a , the mean minology hit rate, H, and false alarm rate, F.While temperature (climatic mean) over the season would be other ways to evaluate forecasts [e.g., computation a poor baseline. One should rather use monthly means of scores, such as mean absolute error (MAE)] are or some simple low-frequency curve ®t to the data over treated throughout the book, the ROC gets a very the same seasonal extent. Even so, the question of using strong play. Albeit an important concept, it has a the sample frequencies of categories versus longer-term major de®ciencyÐit does not consider calibration, relative frequencies usually is not a given. For instance, and poorly calibrated forecasts may be judged to Bob Livezey (in JS03, p. 78) states ``. . . the exclusive be as good as well-calibrated forecasts. This is stat- use of sample probabilities (observed frequencies) of ed in JS03 in some contexts, but is not emphasized, categories of the forecast/observation set being veri®ed and when it is mentioned, it is usually dismissed is recommended, rather than the use of historical data. with the suggestion to recalibrate, in keeping with The only exception to this is for the case where statistics the development theme. are stationary and very well estimated.'' However, as Probability forecasts. In agreement with the recent with many veri®cations, the purpose comes into play. American Meteorological Society (AMS) state- If one is comparing a set of subjective temperature fore- ment on probability forecasts (AMS 2002), JS03 casts with the baseline available to the forecaster when recommends the use of probability forecasts and the forecasts are being made, the baseline is the his- emphasizes their potential value to customers over torical record, not the mean of the time series yet to be nonprobabilistic forecasts. observed, regardless of the stationarity of the time se- Ensembles. The examples are, beside Finley's fore- ries. (Extreme nonstationarity would indicate that cli- casts, in connection with climate or ensembles. Cli- matic forecasts were inappropriate as a baseline, but this mate forecasts can provide good data in many con- is usually not known when the forecasts are being made, texts, but ensembles are overplayed almost to the nonrecognition that there are other ways to make 1 Some would say that the term climatology is not appropriate here probability forecasts. according to its strict de®nition, and that climatic forecasts, clima- Contributions of Allan Murphy. While it is no surprise tological forecasts, or something similar, would be more appropriate to those interested in veri®cation that Murphy's [see the AMS (Glickman 2000)]. AUGUST 2004 FORECASTER'S FORUM 771 so climatic forecasts is the available baseline that is cast are not, to my knowledge, thought of as predictands. used.) In any case, the usual ``skill'' scores computed In the AMS Glossary of Meteorology (Glickman 2000), on multidimensional tables do generally base skill on predictand is de®ned only in terms of regression (p. 594, the sample. 641). It is also worth noting that the reduction of variance Potts states that a predictand (again, her use of the used in regression and predictor selection is relative to term) can be either deterministic or probabilistic. This the overall mean of the sample. Therefore, a very high is now a common use of the term ``deterministic,'' but ``score'' can be obtained by getting only the seasonal it carries more baggage than it is worth in my estimation. variance right. I note that climatological forecasts, as The term just means ``nonprobabilistic'' and not that used by Murphy and Epstein (1989), refer to long term there is necessarily any fundamental law that would and not sample, a departure from Bob Livezey's rec- ``determine'' a speci®c and correct value. But it has wide ommendation concerning categorical forecasts and im- acceptance, probably for want of another more appro- plied by Deque in JS03's chapter 5. priate term, and will likely not fade (that is my prob- While development of objective forecasting systems abilistic forecast). I like the term ``de®nite'' used by is not the subject of the book and gets little direct treat- Drosdowsky and Zhang (in JS03, p. 121) or ``de®nitive'' ment (other than ensembles), the authors do note that better than deterministic. arti®cial skill is a danger in developing a forecasting Potts states (p. 14), ``A deterministic forecast is really system and emphasize cross validation and separate just a special case of a probabilistic forecast in which training and test datasets. In this regard, I note the a probability of unity is assigned to one of the categories changing terminology from what was used when ob- and zero to the others.'' However, the editors state in jective techniques were assuming prominence (e.g., chapter 9 (p. 192), ``. . . deterministic point forecasts Thompson 1950; Allen and Vernon 1951). ``Depen- are not perfectly sharp forecasts with probabilities equal dent'' or ``development'' data were used to develop sys- to 1 and 0, but instead they should be assigned unknown tems, rather than ``training'' data, and the data to judge sharpness.'' Even though these statements are brought whether the system would hold up on new data were together (only) in the glossary, it is not clear what view called ``test'' or ``independent'' data. I perceive that the the book espouses. term training has been highly in¯uenced, unfortunately Potts (p. 22) states that the sample variance is an in my view, by the relatively modest development of unbiased estimate of the population variance, and de- systems requiring iterative solutions (e.g., neural net- ®nes the sample variance as works; Marzban and Stumpf 1996), brought into the 1 n U.S. meteorological literature under the name of adap- S 2 ϭ (x Ϫ x)2 (1) xi͸ tive logic by Hu and Root (1964), rather than more n Ϫ 1 iϭ1 analytic solutions (e.g., regression). The terminology with n Ϫ 1 in the denominator. This also has become and digression from veri®cation per se is likely occa- commonplace both in meteorology (e.g., Wilks 1995, sioned by some of the authors' backgrounds as devel- p. 25) and statistics (e.g., Neter and Wasserman 1974, opers of objective systems. p. 10) texts, but I prefer the de®nition of sample variance It is curious that Potts (in JS03, p. 13) calls the var- having ``n'' in the denominator, iable for which the forecasts are formulated the ``pre- dictand'' and seems to justify that terminology by im- 1 n S 2 ϭ (x Ϫ x)2 , (2) plying that all forecasts are made by ``forecasting sys- xin ͸ tems.'' Strictly speaking, that may be correct, but the iϭ1 preponderance of forecasts, other than those made by as was earlier done (Panofsky and Brier 1958, p. 26; NWP, are made subjectively by forecasters, and the var- Kendall and Stuart 1961, p. 4; Mood 1950, p. 132; Mode iable that they are forecasting is generally not thought 1951, p. 64; Brooks and Carruthers 1953, p. 40; Klugh of as a predictand. This term comes from statistical ob- 1974, p. 53; Johnson and Jackson 1959, p. 32; and Un- jective systems, dating back to 1949 or before.2 Even derwood, et al. 1954, p. 67), with (1) being the unbiased in ``objective'' NWP systems, the variables being fore- estimate of the population parameter. The change in de®nition was already taking place in the 1950s (Pa- nofsky and Brier, op. cit., footnote p. 26); the reason is 2 Lorenz (1956), in his landmark paper on empirical orthogonal unclear to me. Why would the ``mean'' of the squares functions, uses the terms predictand and predictor. These terms were about the sample mean be found by dividing by n Ϫ not in vogue in the 1940s when largely graphical methods of objective forecasting systems were developed and documented in unpublished 1? There are, as Kendall and Stuart (1961, op. cit., p. U.S. Weather Bureau Research Papers (available from the National 4) state, ``. . . reasons for preferring (1) to (2) as an Oceanic and Atmospheric Administrations's library), and published estimator of the parent variance, notwithstanding the in the Monthly Weather Review. Even Bob White, one of the ®rst to fact that the latter is (italics mine) the sample variance.'' apply regression to , did not use the term in early papers (e.g., White and Galligan 1956). However, Gringorten (1949), What if the sample consisted of the entire population; as early as 1949 in a statistical forecasting study using the sorting why would (1) be used? One must keep straight whether of punched cards, carefully de®nes and uses both terms. a sample statistic is being calculated or a population 772 WEATHER AND FORECASTING VOLUME 19 parameter is being estimated. I believe this de®nition was developed to discriminate signal from noise, and a of (1) leads to an inconsistency in JS03's Eqs. (2.2) and ``signal'' or ``forecast'' not being calibrated may not be (5.14), both of which purport to be the de®nition of a terrible disadvantage in some applications, and even, skewness. If one is a sample and one an estimate of the perhaps, in developing an objective system. However, population, it is not clearly stated. Also, Eq. (5.14) I believe this measure of discrimination (only) is cur- seems to not agree with some other texts [e.g., Wilks rently being overemphasized when one is concerned 1995, Eq. (3.8)]. with a full measure of the correspondence between fore- Ian Mason goes into great detail in chapter 3 dealing casts and observations. While past forecasts can be cal- with the two-category event, and continues a theme of ibrated, and this calibration may hold on similar future the book in discussing this situation in terms of Finley's forecasts, ``mislabeled'' forecasts provided to a user, as (1884) tornado forecasts. Up until Ian introduced the discussed in JS03's chapter 7, would likely not be useful concept of the ROC into the meteorological literature, to her. the veri®cation of binary events centered around scores JS03 states, ``The resolution component of the Brier (e.g., Heidke skill, Hanssen and Kuipers, critical success score and the (area under the) ROC curve therefore often index) computed on the 2 ϫ 2 contingency table. If provide very similar information.'' They state, ``A po- computing these scores can be considered a ``method,'' tential advantage of skill measures such as the ROC then one can agree with Ian, ``It is probably fair to say area is that they are directly related to a decision-the- that there was very little change in veri®cation practice oretic approach and so can be easily related to the eco- for deterministic binary forecasts until the 1980's, with nomic value of probability forecasts for forecasts us- the introduction of methods from signal detection theory ers.'' It is not clear to me how reliability (calibration), (SDT) . . . and the development of a general framework which is generally ignored by ROC, can not be crucial for forecast veri®cation by Murphy and Winkler in determining the actual (rather than potential, if prop- (1987).'' erly calibrated) economic value of forecasts. And, is the As mentioned earlier, in some respects, the ROC area under the curve a ``skill measure'' or an ``accu- methodology is, at present, a close runner-up to the racy'' measure? Or neither? JS03, in chapter 8, calls the Murphy and Winkler (1987) paper in in¯uence, and that area under the curve a ``summary skill measure,'' but in¯uence will likely be felt more with time, at least until then goes on to formulate a skill score based on the its major de®ciency of not considering reliability is fully ROC area. Ian Mason makes an interesting reference to appreciated. In this book, the terminology brought from Finley's (1884) tornado forecasts, and concludes, based the ROC methodology ``base rate,'' ``hit rate,'' and on the ROC methodology, ``. . . at 95% level . . . Finley's ``false alarm rate'' is predominantly used, and most forecasts did have some skill!'' scores are in, or put into, those terms. If these terms Richardson also addresses the ROC in chapter 8 and, prevail, it will be because of the ROC in¯uence. We in contrast to Mason in chapter 3 where only the mod- can be thankful the terms pre®gurance and postagree- eled area under the curve Az is discussed, goes to some ment (Panofsky and Brier 1958) are not mentioned. length to discuss the actual area A under the curve when I ®nd it curious that there are several places (p. 50, points are plotted on the hit rate±false alarm rate axes. 53, 55, 70, 73) that negative skill (a set of binary fore- He also de®nes a ROC skill score as ROCSS ϭ casts that do worse than the baseline) seems to be no 2A Ϫ 1, which ranges from 0 for no skill to 1 for perfect problem to the authorsÐjust reverse the labels, and the forecasts. Nothing is said about the possibility that negative skill becomes positive. Well, so! But usually ROCSS could be negative, which may go along with we do not have the luxury of changing the forecasts we Mason (p. 70), ``ROC points below the diagonal rep- are verifying. Evidently, this statement is from the per- resent the same level of skillful performance as they spective, as mentioned earlier, of developing an objec- would if re¯ected about the diagonal. If a forecasting tive system and the in¯uence of ROC; but, the book is system produces ROC points in this area, the forecasts about veri®cation, which involves determining the cor- are mislabeled.'' This is also hinted at by Richardson respondence of the forecasts and the ``observations,'' (p. 175), ``If the forecasts are not reliable, then the not about switching labels at the end. Belaboring this threshold should be adjusted . . . the calibration pro- point is not useful to a ``practitioner'' of veri®cation. cedure . . . makes this adjustment . . .'' Again, quite so ROC is given good treatment, covering its relation- for developing and adjusting a forecast system for future ship to type-1 and -2 errors in hypothesis testing. The use, but not for verifying or evaluating a set of existing parametric (or modeled) area under the ROC curve is forecasts. described along with the associated discrimination dis- The argument is sometimes made that if biased prob- tance. One should keep in mind that this model seems ability forecasts are given to a user, he/she will ®nd that to work in many instances without strong analytic jus- out and ``recalibrate'' or change his/her threshold. I ti®cation. Not explicitly noted is the fact the ROC mea- would counter that, unless the ``system'' making the sures or graphs only ``discrimination'' ability of the set probability forecasts is objective and its characteristics of forecasts, and does not rely on the forecasts being can be expected to hold in the future, the user is playing well calibrated (i.e., reliable). The ROC methodology a dangerous game. A provider of such forecasts ought AUGUST 2004 FORECASTER'S FORUM 773 to also notice the bias and correct it, making the ad- essarily a certain arbitrariness to the values in the cells justment made by the user invalid. It is true, a user may of the de®ning table. set a threshold such that, for instance, more ``forecasts'' Deque in chapter 5 uses the term ``variable'' instead of are made than actually occur (a bias of the statistical developers term ``predictand'' preferred of categorical forecasts) because of his/her utility ma- by Potts (p. 13). Although ``quantity'' is also used for trix, but there is no excuse for providing a user with variables such as temperature and pressure, I prefer to biased probability forecasts. reserve quantity for something quantitative, not a sub- In whatever chapter the ROC is described, both the stitute for a random variable; however, this usage of actual (from plotted points) and the modeled area should quantity has now become common in meteorological be discussed (in that order), not sequestered for the read- literature. er to attempt to coalesce. Some authors prefer the ``mod- JS03 gives some, but minimum, treatment to the re- eled'' diagram and proclaim that connecting consecutive lated topics sampling error, arti®cial skill, and signi®- points by straight lines underestimates the area under cance testing. These are very important topics and de- the curve. Quite so, but if the system being veri®ed serve more consideration. ``Prediction interval'' is con- produced, or is capable of producing, only the speci®c trasted to ``con®dence interval'' (p. 105), but no de®n- probabilities associated with the plotted points in the itive explanation of the difference is given; veri®cation diagram, it is not really appropriate to connect them at as a regression problem is mentioned in chapter 2, and all except as an eye assist; if the points are very close the discussion of the Pearson product moment corre- together, it matters little how they are connected, and lation coef®cient is here (p. 106) and provides an ex- such connection is reasonable. cellent opportunity to demonstrate the difference. Bob Livezey discusses multicategory events and re- One can agree with Deque's statement, ``. . . it is de- views various scores, but soon mentions that most scores sirable that the overall distribution of forecasts is similar are de®cient when compared to the Gandin and Murphy to that of the observations, irrespective of their case-to- (1992) ``equitable'' family of scores. Although the Heid- case relationship with the observations'' (p. 113). How- ke and Peirce skill scores are both equitable, they have ever, the statement, ``Before the forecasts are delivered the undesirable properties of depending on the forecast to unsuspecting users, it is important to rescale (in¯ate) distribution, and not utilizing off-diagonal elements in them'' (p. 114) can be questioned. A de®nition of ``in- the contingency table. Bob Livezey also discusses the ¯ate'' is not given, but has come to mean in many in- relatively new LEPSCAT score and sampling variability stances that de®ned for regression estimates by Klein of the contingency table and skill scores, but a major et al. (1959) (no attribution in JS03), and may or may thrust of the chapter leads to the family of Gandin and not be desirable. The mean square error skill score (p. Murphy scores, how they can be constructed, and the 104) for in¯ated unbiased forecasts will be negative if Gerrity (1992) score (GS), one of the family, is rec- the (Pearson product moment) correlation coef®cient be- ommended as the preferred one. tween nonin¯ated forecasts and observations is Ͻ0.5 Gandin and Murphy scores are based on a reward (or (Glahn and Allen 1966). That is, in developing the re- penalty) value for each cell in the contingency table. gression equation, if the reduction of variance is Ͻ0.25, This is the same concept that is used for determining in¯ated forecasts will have a larger mean square error the ``value'' of a set of forecasts, in contrast to skill, than the sample mean. An unsuspecting user, having where the rewards and penalties are applied to a par- been given in¯ated forecasts, might expect them to be ticular operation and are known or can be estimated [see skillful! Miller and Starr (1960, 82±85) for an early example of It is interesting that one of the most challenging ver- using weather forecasts in decision making under risk]. i®cation problems, that of dealing with spatial ®elds, is The trick here, as a general skill problem, is how to given relatively short treatment. The different anomaly determine the reward (or penalty) matrix. Conditions correlation coef®cients in the literature and S1 score for equitability can be de®ned, but they by themselves (Teweles and Wobus 1954) are de®ned, and principal are insuf®cient to determine the matrix, so certain fur- component analysis is introduced as a method of re- ther conditions (or restraints) are made. Gerrity set con- ducing dimensionality. Spatial rainfall forecasts are sin- straints that, as it turned out, gives his scores a re- gled out as being especially challenging to verify. markable property. A Gerrity score computed on the Introducing a ``spacial'' dimension adds a truly new full k-cell table is the same as the arithmetic mean of dimension to the complexity of veri®cation. The basic all the k Ϫ 1 two-category Gerrity scores formed by question of ``What is a good forecast?'' becomes more combining categories on either side of the partitions dif®cult and may depend more heavily on the forecast's between consecutive categories. Livezey notes this re- purpose. For instance, if the ``pattern'' is right, but dis- markable property and states, ``Because of its conve- placed, is that a good forecast? While discussed in ref- nience and built-in consistency, the family of GS is erence to rainfall forecasts, the question is pertinent for recommended here as equitable scores for forecasts of most all ®elds of ``weather'' variables. Novel approach- ordinal categorical events.'' This appears to be a good es relying on translating patterns to get a good match choice, but one must still remember that there is nec- are only brie¯y introduced. The last paragraph mentions 774 WEATHER AND FORECASTING VOLUME 19 the possibility that the ®eld can be veri®ed at different that a forecast system does not) really have to be reli- spatial scales. Although pertinent references are given, able, one just has to calibrate so that they are. This is one is left with a certain incompleteness regarding spa- emphasized later (p. 163) and contributes to the per- tial ®elds. ception that the book is oriented for a developer of My biggest disappointment with JS03 is their rolling systems, not for one who is going to verify or evaluate the veri®cation of probability forecasts into a chapter an actual, unchangeable set of forecasts. Both purposes shared by ensemble forecasting. JS03 states (p. 155), of veri®cation are important, but JS03 never clearly ``Ensemble forecasting is now one of the most com- makes the distinction. monly used methods for generating probability forecasts Chapter 8, written by Richardson, discusses the third that can take account of uncertainty in initial and ®nal type of goodness previously identi®ed by Murphy, value conditions.'' While ensemble forecasting is in its as- or utility. Other measures associated with the corre- cendancy, and the statement is true in terms of the un- spondence of forecasts and observations (e.g., skill and certainty of the initial conditions estimated by data as- accuracy) are not directly measures of the usefulness of similation, precious little overall work has been done forecasts to a user, although they are certainly related. operationally with ensembles in a postprocessing prob- This topic was brought into modern U.S. meteorological abilistic sense, except for the occurrence of precipita- literature by Jack Thompson (1950, 1952; Thompson tion, which, being binary, lends itself well to direct rel- and Brier 1955). Thompson formulated the now-familiar ative frequency treatment. It is not the most commonly cost±loss decision model,3 and the same model has been used method of producing probability forecasts; rather, described and discussed many times since, as Richard- statistical postprocessing of single-model runs have pro- son indicates, and is summarized in this chapter. In keep- duced a plethora of probabilistic guidance forecasts for ing with a theme in JS03, hit rate and false alarm rate many weather elements for many years, including prob- are brought into play, and the usefulness of the Peirce ability of , type of precipitation (freezing, skill score and the Clayton skill score in this context frozen, or liquid, and showers, , or ), wind, are discussed. Through analysis, Richardson concludes, amount, ceiling height, and visibility. No refer- in concert with previous authors, ``. . . no single thresh- ence is given to U.S. or Canadian work in the short old probability (e.g., a deterministic forecast) will be range (0±10 days), both of which have been primary optimal for a range of users with different cost/loss ra- centers of postprocessing activity for many years. All tiosÐthis is a strong motivation for providing proba- of this information, along with, more recently, ensemble bility rather than deterministic forecasts.'' information, is provided to forecasters as guidance in If there is a single user with an identi®ed cost of making the ``of®cial'' forecasts. In addition, probability protection for ``adverse'' weather and loss if protection of precipitation forecasts have been produced as of®cial is not taken, then the utility for a set of forecasts for forecasts by the U.S. National Weather Service since that user can be calculated. Also, if the distribution of 1966 (Hughes 1980). It is a disservice to the unsus- users is known, the overall utility can be calculated and pecting reader to suggest that we really need only be related to the Brier score, which Richardson shows. For concerned with probability forecasts produced directly instance, if the distribution of cost±loss ratios for users from ensembles. Unfortunately, the ``equating'' of prob- is uniform over the 0±1 range, then (per unit loss) ``the ability forecasts and ensembles is carried over into chap- Brier skill score is the overall value'' (p. 183). ter 8, where in the statement, ``The contrasting effects I found JS03 to be very interesting reading and will of ensemble size on the ROC area and Brier skill scores be useful, provided some of its concepts and statements are examined in Section 8.5 . . . ,'' there is no recog- are carefully evaluated for the speci®c use to be made nition that ensembles are not the only game in town. of them. Perhaps this paper and likely follow-on dia- JS03's editors should have forced a more balanced view. logue will help improve the second edition. I even ask, why is a chapter on ensemble forecasting, which is a method of making forecasts, not verifying REFERENCES them, put into a book on veri®cation? Why not include a chapter on regression, or discriminant analysis, both Allen, R. A., and E. M. Vernon, 1951: Objective weather forecasting. methods of producing probabilistic forecasts? Compendium of Meteorology, T. F. Malone, Ed., Amer. Meteor. Soc., 796±813. JS03 states ``. . . the two most important attributes of AMS, 2002: Enhancing weather information with probability fore- probability forecasts (are) referred to as reliability and casts. Bull. Amer. Meteor. Soc., 83, 450±452. resolution'' (p. 138), not news to the reader at this point. Brier, G. W., 1950: Veri®cation of forecasts expressed in terms of Later, they say that resolution is the most important probability. Mon. Wea. Rev., 78, 1±3. ÐÐ, and R. A. Allen, 1951: Veri®cation of weather forecasts. Com- attribute of a forecast system (p. 142). While these two pendium of Meteorology, T. F. Malone, Ed., Amer. Meteor. Soc., statements are not quite contradictory, editing could 841±855. have provided a more clear picture of the authors' views and more clearly differentiated a set of probability fore- 3 Anders Angstrom and others had discussed this same model in casts from a system that could produce reliable forecasts papers not well known until relatively recently (Liljas and Murphy by recalibrating. The idea is that forecasts do not (or 1994). AUGUST 2004 FORECASTER'S FORUM 775

Brooks, C. E. P., and N. Carruthers, 1953: Handbook of Statistical Symp. on Probabilistic and Statistical Methods in Weather Fore- Methods in Meteorology. Her Majesty's Stationery Of®ce, 412 casting, Nice, France, WMO, 219±228. pp. ÐÐ, 1982a: A model for assessment of weather forecasts. Aust. Finley, J. P., 1884: Tornado predictions. Amer. Meteor. J., 1, 85±88. Meteor. Mag., 30, 291±303. Gandin, L. S., and A. H. Murphy, 1992: Equitable scores for cate- ÐÐ, 1982b: On scores for yes/no forecasts. Preprints, Ninth Conf. gorical forecasts. Mon. Wea. Rev., 120, 361±370. on Weather and Forecasting and Analysis, Seattle, WA, Amer. Gerrity, J. P., Jr., 1992: A note on Gandin and Murphy's equitable Meteor. Soc., 169±173. skill score. Mon. Wea. Rev., 120, 2707±2712. Miller, D. W., and M. K. Starr, 1960: Executive Decisions and Op- Glahn, H. R., and R. A. Allen, 1966: A note concerning the ``in- erations Research. Prentice Hall. ¯ation'' of regression forecasts. J. Appl. Meteor., 5, 124±126. Mode, E. B., 1951: Elements of Statistics. Prentice-Hall, 377 pp. Glickman, T., Ed., 2000: Glossary of Meteorology. 2d ed. Amer. Mood, A. M., 1950: Introduction to the Theory of Statistics. McGraw- Meteor. Soc., 755 pp. Hill, 433 pp. Gringorten, I. I., 1949: A study in objective forecasting. Bull. Amer. Murphy, A. H., 1993: What is a good forecast? An essay on the Meteor. Soc., 30, 10±15. of goodness in weather forecasting. Wea. Forecasting, 8, Hu, M. J. C., and H. E. Root, 1964: An adaptive data processing 281±293. system for weather forecasting. J. Appl. Meteor., 3, 513±523. ÐÐ, and R. L. Winkler, 1987: A general framework for forecast Hughes, L. A., 1980: Probability forecastingÐReasons, procedures, veri®cation. Mon. Wea. Rev., 115, 1330±1338. problems. NOAA Tech. Memo. NWS FCST 24, 84 pp. ÐÐ, and E. P. Epstein, 1989: Skill scores and correlation coef®cients Johnson, P. O., and R. W. B. Jackson, 1959: Modern Statistical Meth- in model veri®cation. Mon. Wea. Rev., 117, 572±581. ods: Descriptive and Inductive. Rand McNally, 514 pp. Neter, J., and W. Wasserman, 1974: Applied Linear Statistical Models. Jolliffe, I. T., and D. B. Stephenson, Eds., 2003: Forecast Veri®cation: Richard D. Irwin, 842 pp. A Practitioner's Guide in Atmospheric Science. John Wiley and Panofsky, H. A., and G. W. Brier, 1958: Some Applications of Sta- Sons, 254 pp. tistics to Meteorology. The Pennsylvania State University, 224 Kendall, M. G., and A. Stuart, 1961: Inference and Relationship. Vol. pp. 2, The Advanced Theory of Statistics, Hafner, 676 pp. Swets, J. A., 1973: The relative operating characteristic in psychol- Klein, W. H., B. M. Lewis, and I. Enger, 1959: Objective prediction ogy. Science, 182, 990±1000. of ®ve-day mean temperature during . J. Meteor., 16, 672± Teweles, S., and W. G. Wobus, 1954: Veri®cation of prognostic charts. 682. Bull. Amer. Meteor. Soc., 35, 455±463. Klugh, H. E., 1974: Statistics: The Essentials for Research. John Thompson, J. C., 1950: A numerical method for forecasting rainfall Wiley and Sons, 426 pp. in the Los Angeles area. Mon. Wea. Rev., 78, 113±124. Liljas, E., and A. H. Murphy, 1994: Anders Angstrom and his early ÐÐ, 1952: On the operational de®ciencies in categorical weather papers on probability forecasting and the use/value of weather forecasts. Bull. Amer. Meteor. Soc., 33, 223±226. forecasts. Bull. Amer. Meteor. Soc., 75, 1227±1236. ÐÐ, and G. W. Brier, 1955: The economic utility of weather fore- Lorenz, E. N., 1956: Empirical orthogonal functions and statistical casts. Mon. Wea. Rev., 83, 249±254. weather prediction. Statistical Forecasting Project, Massachu- Underwood, B. J., C. P. Duncan, J. A. Taylor, and J. W. Cotton, 1954: setts Institute of Technology Scienti®c Rep. 1, 49 pp. Elementary Statistics. Appleton-Century-Crofts, 29 pp. Marzban, C., and G. J. Stumpf, 1996: A neural network for tornado White, R. M., and A. M. Galligan, 1956: The comparative accuracy prediction based on doppler -derived attributes. J. Appl. of certain statistical and synoptic forecasting techniques. Bull. Meteor., 35, 617±626. Amer. Meteor. Soc., 37, 1±7. Mason, I., 1980: Decision-theoretic evaluation of probabilistic pre- Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. dictions (using the relative operating characteristic). Proc. WMO Academic Press, 467 pp.