Exchange of Long-Range Verification Scores

WORLD METEOROLOGICAL ORGANIZATION CBS/ET/DVSLRF/Doc. 3(6) ______

COMMISSION FOR BASIC SYSTEMS (5.IV.2002) MEETING OF EXPERT TEAM TO DEVELOP A VERIFICATION SYSTEM FOR LONG-RANGE ITEM: 3 FORECASTS

MONTREAL, CANADA, 22-26 APRIL 2002 Original: ENGLISH

EXCHANGE OF LONG-RANGE VERIFICATION SCORES (Real-time Monthly and Seasonal Forecasting at the Met Office)

(Submitted by Richard Graham, Met Office, UK)

Summary and purpose of document

This document:

- Provides a brief overview of the current Met Office real-time monthly and seasonal prediction system, with emphasis on seasonal products;

- Provides an example of how skill information may be attached to long-range forecasts;

- Describes current and future planned developments to the system;

- Provides an update on progress with generating the Standard Verification System scores;

- Shows results of “user oriented” verification using the Heidke score. CBS/ET/DVSLRF/Doc. 3(6), p. 2

Real-time Monthly and Seasonal Forecasting at the Met Office

1. Introduction

A nine member AGCM ensemble, forced with persisted SST anomalies, has been used to produce monthly outlooks for the UK since 1991 and experimental global seasonal outlooks since January 1998. The seasonal forecasts and further information on Met Office activities in long-range forecasting may be viewed at www.metoffice.com/research/seasonal.

The purpose of this paper is to:

- Provide a brief overview of the current forecasting system, with emphasis on seasonal products. - Provide an example of how skill information may be attached to long-range forecasts - Describe current and future developments to the system - Provide an update on progress with generating the Standard Verification System scores - Show results of “user oriented” verification using the Heidke score.

2. The real-time AGCM monthly and seasonal ensemble system

AGCM and atmospheric initial conditions

Real-time predictions are made using the Met Office’s HadAM3 AGCM (Pope et al., 2000). The resolution of the model is 2.5o latitude, 3.75o longitude and 19 vertical levels. The model is integrated each week to a range of 6-months in a 9-member ensemble, with atmospheric initial conditions provided by consecutive operational NWP analyses at 6hr intervals; the first member being initialised with the 00Z analysis each Tuesday, the final member with the 00Z analysis on the following Thursday.

SST forcing and treatment of sea-ice

Each ensemble integration is provided with the same SST lower boundary forcing, which is updated every 24hrs during the integrations. The SST forcing is derived from persisting the SST anomaly observed during a 4-week period prior to the forecast initial date. The 4-week period lags the forecast initial date by about 10 days, because of a necessary wait for SST observations (the SST and ice-cover analysis used is that of Reynolds and Smith, 1994). The persistence-based SST forecasts are produced by adding the 4-week SST anomalies (SSTA) to climatological averages appropriate for the forecast period (SSTclim). A representation of sea-ice is included in the following way; where there is o open sea initially, ice forms if SSTclim + SSTA falls below –1.8 C. Similarly, the impact of anomalous conditions at the end of the cold season are represented such that a warm/cold anomaly hastens/delays the melt-back of sea-ice.

Treatment of land surface variables

Initial conditions for soil moisture, soil temperature and snow cover are taken from climatology. Land surface to atmosphere exchanges are represented using the MOSES scheme (The Met. Office Surface Exchange Scheme) of Cox et al. 1999. CBS/ET/DVSLRF/Doc. 3(6), p. 3

3. Monthly forecast products

Output from the first 31 days of the ensemble integrations is post-processed to produce outlooks for five variables (mean, maximum and minimum temperature, precipitation and sunshine) averaged over two forecast periods (days 11-17 and days 18-31), for 10 regions of the UK and also for parts of Europe

(forecasts for the 4-10 day period are provided using output from the ECMWF EPS). Values for all 5 variables are derived using both direct model output and from regression of the forecast pmsl fields for each member, resulting in an effective 18-member ensemble. Forecasts are expressed in categorical format using quintile categories (well above normal, above normal, normal, below normal, well below normal) based on the 1961-90 climate for each UK region. Both deterministic (ensemble- mean) and probability forecasts are produced. The forecasts form the basis of a commercial product “the Monthly Outlook” with most customers operating in the energy, water and retail sectors. A “higher confidence” indication is provided with the deterministic forecasts when the ensemble spread is lower than a threshold (details of the calculation of the confidence index may be found in Graham et al, 2001).

4. Global seasonal forecast products

Forecasts are currently provided for anomalies in 3-month-average 850 hPa temperature (used as a proxy for surface temperature) and precipitation at two lead times, months 1-3 (i.e. zero lead) and months 2-4 (1 month lead). Two formats are used, namely;

• Probability that the anomaly will be above or below zero (based on the ensemble distribution).

• Deterministic forecasts of the anomaly sign and magnitude (based on the ensemble mean)

Products are in map format plotted at the model grid-point resolution for the globe and for a number of regional areas.

Forecast 3-month averages are calculated from daily model values, valid at 12Z and expressed as anomalies from the model climatology appropriate to the forecast period. Use of model rather than observed climatology as the reference climatology provides an a posteriori correction for model bias. The appropriate model climatology for the forecast period is calculated from weighted averages of monthly climatologies generated in seasonal simulation experiments for the period 1979-1997.

Both the probability and deterministic forecasts are provided with optional skill templates, which provide the best available information for masking out regions in which the model is judged to have no skill. The templates are derived from hindcast assessments of the HadAM3 model, as discussed in Section 5.

4.1 Seasonal forecast applications

The real-time seasonal forecasts are made available to the National Meteorological Services of over 130 countries. Additional uses include;

 Advice to UK Government departments, including the Crisis and Humanitarian Affairs Department of the UK Department for International Development.

 Provision to various Regional Climate Outlook Fora. CBS/ET/DVSLRF/Doc. 3(6), p. 4

 Provision of basic seasonal forecast map products to a number of commercial users on a trial basis.

 Site-specific seasonal forecasts are also being developed and trialed for commercial use in the weather derivatives sector.

3. Model validation

Validation of the model is currently being upgraded using hindcast experiments that are run in true retrospective forecast mode. This work has been delayed, for various reasons, but is now expected to be complete in autumn 2002 and will include generation of the SVS core diagnostics ROC and RMSSS. Some preliminary results focussing on user oriented verification are presented in Section 6.

Current validation provided with the real-time forecasts is based on HadAM3 simulations forced with observed SST over the 19-year retrospective period 1979-97. Thus current validation is over optimistic, though probably reasonable for up to one season ahead. (Graham et al., 2000 found that on average, much of the skill found in the first 3 months with observed SST was retained in runs using persisted SSTA, e.g. average skill scores are depressed by no more than 5% in the tropics.) Results from these observed SST runs and their use in generating skill templates for provision with the forecasts, are summarised below.

5.1 Probabilistic skill

Probabilistic skill for predictions of above or below normal categories has been assessed using the Relative Operating Characteristic (Stanski et al. 1989) with event probability thresholds of 0%, 20%, 40%, 60% and 80%. The area under the ROC curve (referred to as the ROC score) provides a useful overall index of skill; a value of 0.5 (the area under the diagonal defined by hit rate = false alarm rate) or less indicating no skill, and a value of 1 indicating perfect deterministic skill. Using Monte Carlo calculations, in which ROC scores were repeatedly calculated after scrambling the yearly order of the simulations, it was found that the threshold score for 90% significance lies between 0.60 and 0.65, and the threshold for 95% significance between 0.65 and 0.7. Verification was performed against the ECMWF ERA dataset for years 1979-93 and the ECMWF operational analysis for years 1994-1997.

To produce the skill templates ROC scores are calculated at all model gridpoints and plots of the geographical distribution of the score generated. An example plot for zero lead forecasts of MAM 850 hPa temperature is provided in Fig. 1 and gives an indication of the spatial variation of skill though, because of the relatively small sample size, individual values should be treated with caution. (For each grid point 19 probability forecasts are available for each season). ROC scores exceed the 0.5 threshold for skill over many parts of the globe (shading breaks in at 0.6 on Fig. 1 to give an estimate of regions with significant skill). As expected relatively high ROC scores are most widespread in the tropics where the ensemble simulations frequently reduce to a single deterministic solution (e.g. consistent indication of the event in all 9 ensemble members) and maximum deterministic skill over the 19-year period is approached or achieved in many areas. Significant skill is also indicated over some extratropical regions, including parts of North America and Europe.

The shaded area on Fig. 1 (ROC score equal to or greater than 0.6) provides a template for regions in which the model has significant skill near the 90% level. Eight such skill templates have been generated for each forecast variable, one for each of the 4 conventional seasons and 2 lead times (months 1-3 and months 2-4) evaluated in the simulation experiments. As the real-time forecasts are issued weekly rather than once per season the template for the conventional season that best matches the valid forecast period is supplied with the forecast. CBS/ET/DVSLRF/Doc. 3(6), p. 5

An example probability forecast for 850 hPa temperature below/above normal is provided in Fig. 2a with the template applied to show only regions where model skill is available (the corresponding forecast without the template is provided in Fig. 2b). Skill templates for the event precipitation below/above normal (not shown here) have been calculated using ROC scores in the same way.

Our Expert Team has been tasked with making recommendations on how information on forecast skill may be provided to the user. The skill template method described above is one means of achieving this aim.

5.2 Templates for deterministic skill

The geographical distribution of deterministic skill has been investigated by calculating the point correlations of UM ensemble-mean and observed fields over the 19-year simulation timeseries. An example spatial plot of Correlation Coefficients (CCs) for MAM 850 hPa temperature is shown in Fig. 3. The significance of the correlations has been estimated using a Monte-Carlo technique to estimate the probability of achieving equivalent correlations by chance (500 correlations were calculated, each after randomly scrambling the yearly order of the ensemble-mean values). Correlations are plotted only when they are positive and significant at the 90% level or higher. As found for the ROC score assessments, skill is best in the tropics where CC values frequently exceed 0.6, with local peaks in excess of 0.8. Significant correlations are also present in the extratropics, including parts of Europe, though with lower CC values of order 0.4 with peaks to 0.6. The shaded area on Fig. 3 (skill significant at the 90% level or higher) is used to form skill templates for the deterministic forecasts. There is generally good agreement between significant anomaly correlation skill (Fig. 3) and regions of relatively high ROC score (Fig. 1).

6. New validation studies

A new set of hindcasts studies designed to improve the definition of the model climatology and to upgrade the skill assessment is now nearing completion. Twelve sets of hindcasts are being run over the retrospective period 1982-2000. The integrations are forced with prescribed SSTs from a statistical scheme based on persistence of SST anomalies. Each hindcast set is defined by the calendar month of initialisation, and comprises 19 (one for each year in the period), 6-month-range, 9-member ensemble integrations initialised from the first day of that calendar month in each year. Each hindcast set describes the model background “climate” (which varies with both integration lead time and time of year) appropriate for calibrating real-time forecasts initialised at the start of the corresponding month. Similarly, validation of each set of hindcasts will provide skill information appropriate for use with forecasts initialised at the beginning of the corresponding month. Definition of model climates and predictive skill from 12 (monthly) start points in the seasonal cycle rather than, as at present, interpolating from 4 (seasonal) start points is expected to improve the quality of the forecast and its skill assessment. Analysis of results to date has focussed on generating “user friendly” assessment of skill using the Heidke score.

6.1 HadAM3 skill for prediction of the observed anomaly sign

Initially, a percent correct analysis was tried for a “user friendly” presentation of skill for two category forecasts (i.e forecasts for above or below the climate normal). A difficulty with this score is that the two categories are not equally likely if the climate mean rather than the climate median is used as the threshold (our experience is that most users prefer the climate normal as reference). This is most noticeable for precipitation in low latitudes, when the climate mean value may frequently be composed of a few years with substantially above normal rainfall and the majority of years falling in the below normal category. The greater population of the below category leads to greater percent correct skill for the below category relative to the above category. This could potentially be CBS/ET/DVSLRF/Doc. 3(6), p. 6 misleading to the user, resulting for example in the assumption that the model is particularly good at forecasting “drought conditions”. The Heidke score allows for the different populations of the categories, as described below.

To avoid the problems described above, skill assessments for predictions of the sign of seasonal temperature and precipitation anomalies have been assessed using the Heidke score (Heidke, 1926). Here we present results for zero lead predictions of March-April-May (MAM) anomalies. Assessment for other seasons and with a wider range of measures, including the standard SVS scores is planned when all integrations have been completed. The application of the Heidke skill score described here is intended to provide a simple guide to skill that may be readily interpreted by non-specialists - and thus be suitable for display on the Met Office public seasonal forecast website.

The Heidke skill score (Heidke, 1926) for predictions of a category is defined as

H- C S = H T- C

Where, T is the total number of times the category is predicted to occur H is the number of correct predictions (or “hits”), i.e. the number of times the predicted category is the same as the observed category C is the average number of correct forecasts out of T cases that would be obtained just by chance.

SH is positive if the number of hits is greater than expected by chance, with a maximum score of 1.0 if all forecasts are hits; SH is negative if the number of hits is less than expected by chance, with a lowest possible score of –1.0. SH = 0 if the number of hits is equal to that expected by chance.

In these assessments forecasts are stratified into predictions of above or below the climate normal, and skill assessed for each of the two categories. For this purpose the “forecast” category was selected as the category predicted most likely to occur i.e. the category indicated by 5 or more ensemble members (out of 9). Thus, for example, a forecast of above normal conditions is inferred when 5 or more ensemble members indicate above normal. With this approach the verification will be consistent with the simplest interpretation of the probability forecasts provided on our public website pages.

Example global plots of Heidke score are shown in Figures 4a&b for zero-lead MAM forecasts of above and below normal surface temperature, respectively. The sign of the predicted surface temperature anomaly is inferred (with a one-to-one mapping) from the sign of the predicted 850hPa temperature anomaly. Verification is against the New (2000) dataset. The general distribution of prediction skill is similar for both above and below normal categories, though in the higher skill regions peaks in skill (e.g. SH>0.6) appear somewhat more widespread for the above average category. Although the Heidke score is designed to correct for skill biases arising from unequal observed frequencies of the above and below categories, the sample studied here is too small to be certain that the small differences in score seen between Figures 4a&b represent a true difference in model performance. For both categories peak scores (SH>0.6) are found over tropical parts of South America, Africa, the Arabian peninsula, Indonesia and Australia. Outside the tropics best skill is found over parts of North America and parts of north-eastern Asia. Over Europe skill is positive and of order SH=0.2 with a peak exceeding 0.4 near northern Iberia.

The relationship between the Heidke score and the percent correct score may be loosely CBS/ET/DVSLRF/Doc. 3(6), p. 7 recovered (to help the user) for the two category case if the assumption is made that above and below categories are equally likely (a reasonable approximation for many extratropical regions). In this case CBS/ET/DVSLRF/Doc. 3(6), p. 8

H S +1 = H T 2

Thus the SH values of between 0.2 and 0.4 over Europe indicate percent correct scores of between 60% and 70%.

Corresponding plots of Heidke skill score for MAM precipitation are provided in Figures 5a&b. As for the temperature results the geographical distribution of skill is similar for both above (Figure 5a) and below (Figure 5b) categories. Best skill is again found in tropical regions with peak scores exceeding 0.6 in north-east Brazil (above average category), parts of west Africa (below average category), Indonesia and northern and eastern Australia. In the northern extratropics peaks in skill are present over western North America, eastern Europe, the eastern Mediterranean and parts of central Asia (centred near Afghanistan). Over western Europe highest skill is found near 60 oN, where scores of order 0.2-0.4 are found over northern Scotland and Scandinavia. Elsewhere in western Europe scores are generally less than 0.2.

6.2 Verification work as part of the EU project DEMETER

Retrospective seasonal forecasts, for the 30-year period 1969-1998, are being performed with both the Met Office coupled ocean-atmosphere seasonal prediction model (GloSea) and the HadAM3 uncoupled AGCM as part of the EU project DEMETER (Development of a European Multi-model Ensemble System for Seasonal to Interannual Prediction). The broad aim of DEMETER is to assess the skill of a potential European multi-model ensemble comprising coupled ocean-atmosphere models from six European centres, and to explore potential applications of such a forecast system in the agricultural and health sectors. As part of the DEMETER project a central system for verification of all six European models has been built by ECMWF in coordination with the Met Office and other DEMETER partners. The verification diagnostics used incorporate the core SVS diagnostics of ROC and RMSSS.

7. Future plans

The main development on the horizon is the implementation of the Met Office’s coupled ocean/atmosphere model for global seasonal prediction (GloSea). The GloSea coupled prediction model is currently being extensively evaluated as part of EU project DEMETER and is also being run routinely in real-time mode. Real-time forecasts are planned to run in tandem with the ECMWF System2 prediction model to evaluate the benefits of a multi-model ensemble. Following skill comparisons with the HadAM3 AGCM system, GloSea is expected to replace the AGCM as the operational system in late 2002. A key future activity in forecast applications will be further development and evaluation of site-specific seasonal forecasts.

Summary

- Validation results confirm that the HadAM3 AGCM forced with persisted SSTA has predictive skill over a wide area of the globe. Best skill is found in tropical regions but positive skill is also available in extratropical regions, including Europe, in some seasons.

- Skill masking using threshold values on the ROC (or other) scores, provides an effective means of conveying skill information to the forecast user. CBS/ET/DVSLRF/Doc. 3(6), p. 9

- The Heidke score may be a useful score for providing “user oriented” evaluation. Affording something of a compromise between percent correct and more rigorous scores.

- The Met Office is expected to complete generation of the SVS standard scores evaluated over a 19 year period (1982-2000) in autumn 2002.

- Verification over a longer hindcast set (30 years, 1969-1998) is being performed as part of the DEMETER project, which completes in spring 2003.

References:

Cox , P.M., R.A. Betts, C.B. Bunton, R.L.H. Essery, P.R. Rowntree and J. Smith, 1999: The impact of new land surface physics on the GCM simulation of climate and climate sensitivity. Climate Dynamics 15, 183-203.

Graham. R.J., Evans, A.D.L., Mylne, K.R., Harrison, M.S.J. and Robertson, K.B., 2000: An assessment of seasonal predictability using atmospheric general circulation models. Q.J.R. Meteorol. Soc, 126, 2211-2240.

Graham, R.J., Davey, M.K., Becker, B., McLean, P.J., Naylor, M.R., Colman, A.W., Ineson, S., Huddleston, M. and Barnes, R., 2001. Monthly and Seasonal Forecasting and Applications at the Met Office. Submitted to proceedings of the ECWMF Seasonal Forecast Users Meeting, 2001. Heidke, P., 1926: Berechnung des Erfolges und der Gute Windstarkevorhersagen im Sturmwarnungsdienst. Geografiska Annaler, 8, 301-349.

New, M., Hulme, M. and Jones,P., 2000: Representing Twentieth-Century Space-Time Climate Variability. Part II: Development of 1901-96 Monthly Grids of Terrestrial Surface Climate Journal of Climate: Vol. 13, No. 13, pp. 2217-2238.

Pope, V.D., Gallani, M.L., Rowntree, P.R. and Stratton, R.A., 2000: The impact of new physical parameterizations in the Hadley Centre climate model: HadAM3. Climate Dynamics, 16, 2-3, Pp123- 146.

Reynolds, R.W. and T.M. Smith, 1994: Improved Global Sea Surface Temperature Analyses Using Optimum Interpolation. J. Climate, 7(6), 929-948.

Stanski, H.R., L.J. Wilson and W.R. Burrows, 1989: Survey of common verification methods in Meteorology. World Weather Watch Technical Report. No. 8, WMO/TD 358, 114pp. CBS/ET/DVSLRF/Doc. 3(6), p. 10

Fig. 1: Geographical distribution of ROC score obtained at each model grid-point for zero-lead forecasts of the event MAM 850hPa temperature below (or above) normal. Shading begins at 0.6; shading interval is 0.1.

Fig. 2a: Example probability forecast with skill template (grey shading). The forecast depicted is the probability (%) of a positive anomaly in mean 850 hPa temperature in the month 1-3 period (valid: 3rd September to 2nd December 1999). Redder colours indicate higher probability of a warm anomaly; bluer colours indicate higher probability of a cold anomaly. The skill template used is that for SON. CBS/ET/DVSLRF/Doc. 3(6), p. 11

Fig. 2b: As Fig. 2a, but without the skill template.

Fig. 3: Geographical distribution of correlation coefficients between zero-lead ensemble mean forecasts of 850 hPa temperature anomaly and corresponding observed (ERA/ECMWF) anomalies at each model grid point over the 19 year simulation sample. Shading interval is 0.2; values are shown only where correlations are positive and significant at the 90% level or higher.

4a

4b CBS/ET/DVSLRF/Doc. 3(6), p. 12 Figure 4a&b: Heidke skill scores for zero-lead predictions of above average (a) and below average (b) MAM temperature, assessed against the UEA T2m dataset. Positive scores are shaded at intervals of 0.2, all regions with zero or negative scores shown as blue.

5a

5b

Figure 5a&b: As Figure 4, but for precipitation (assessed against GPCP analyses).

______