DECEMBER 2013 HUDSON ET AL. 4429

Improving Intraseasonal Prediction with a New Ensemble Generation Strategy

DEBRA HUDSON,ANDREW G. MARSHALL,YONGHONG YIN,OSCAR ALVES, AND HARRY H. HENDON Centre for Australian Weather and Climate Research,* Melbourne, Victoria, Australia

(Manuscript received 14 February 2013, in final form 11 July 2013)

ABSTRACT

The Australian Bureau of Meteorology has recently enhanced its capability to make coupled model forecasts of intraseasonal climate variations. The Predictive Ocean Atmosphere Model for Australia (POAMA, version 2) seasonal prediction forecast system in operations prior to March 2013, designated P2-S, was not designed for intraseasonal forecasting and has deficiencies in this regard. Most notably, the forecasts were only initialized on the 1st and 15th of each month, and the growth of the ensemble spread in the first 30 days of the forecasts was too slow to be useful on intraseasonal time scales. These deficiencies have been addressed in a system upgrade by initializing more often and through enhancements to the ensemble gen- eration. The new ensemble generation scheme is based on a coupled-breeding approach and produces an ensemble of perturbed atmosphere and ocean states for initializing the forecasts. This scheme impacts fa- vorably on the forecast skill of Australian rainfall and temperature compared to P2-S and its predecessor (version 1.5). In POAMA-1.5 the ensemble was produced using time-lagged atmospheric initial conditions but with unperturbed ocean initial conditions. P2-S used an ensemble of perturbed ocean initial conditions but only a single atmospheric initial condition. The improvement in forecast performance using the coupled- breeding approach is primarily reflected in improved reliability in the first month of the forecasts, but there is also higher skill in predicting important drivers of intraseasonal climate variability, namely the Madden– Julian oscillation and southern annular mode. The results illustrate the importance of having an optimal ensemble generation strategy.

1. Introduction usually does not appear above the noise until after 1-month lead time. However, over the past few years, Dynamical model prediction of intraseasonal climate with improvements to numerical prediction models, is a developing field worldwide, exhibiting a growing ensemble prediction techniques, and initialization, there research and operational focus (e.g., Toth et al. 2007; have been good indications of skill at regional levels Gottschalck et al. 2008; Vitart et al. 2008; Brunet et al. for forecasts in this prediction gap of beyond 1 week but 2010; Rashid et al. 2010; Hudson et al. 2011a,c; Marshall shorter than a season (e.g., Vitart et al. 2008; Hudson et al. 2011a,b,c; Vitart and Molteni 2011). Traditionally, et al. 2011a). Forecast models now also have improved it has been a difficult time scale for which to provide representation and prediction of slower internal pro- skillful forecasts, since beyond a week the system has cesses such as the tropical Madden–Julian oscillation lost much of the information from the atmospheric ini- (MJO) (e.g., Rashid et al. 2010; Marshall et al. 2011a; tial conditions (upon which weather forecasts rely), Vitart and Molteni 2011) that can contribute to pre- and the variability forced by slow variations of bound- dictability at lead times longer than those of midlatitude ary conditions (e.g., tropical sea surface temperatures) weather events but shorter than a season. This potential for intraseasonal prediction has recently been recog- nized by the World Meteorological Organization through * The Centre for Australian Weather and Climate Research is a partnership between the Australian Bureau of Meteorology and the implementation of a new subseasonal prediction the Commonwealth Scientific and Industrial Research Organisation. project (http://www.wmo.int/pages/prog/arep/wwrp/new/ S2S_project_main_page.html). The research project aims to improve our forecast skill and our understanding Corresponding author address: Dr. Debra Hudson, Centre for Australian Weather and Climate Research, Bureau of Meteorol- of intraseasonal climate variability, as well as foster- ogy, GPO Box 1289, Melbourne, VIC 3001, Australia. ing the improvement and rollout of operational in- E-mail: [email protected] traseasonal forecasting systems and uptake by the

DOI: 10.1175/MWR-D-13-00059.1

Ó 2013 American Meteorological Society Unauthenticated | Downloaded 10/05/21 05:21 PM UTC 4430 MONTHLY WEATHER REVIEW VOLUME 141 applications community. It is a joint project between the paper that although the latter approach might be ap- weather (the World Weather Research Program) and propriate for seasonal prediction, it is inadequate for climate (the World Climate Research Program) re- intraseasonal prediction, since by not perturbing the search communities. This effort underscores the chal- atmosphere at the initial time, there is insufficient spread lenges of addressing this time scale, which is influenced in the first 30 days of the forecasts. As a result, a novel by short-lived atmospheric processes as well as longer coupled ensemble initialization scheme has been de- variations in, for example, the ocean, sea ice, and land. veloped for initializing the intraseasonal forecasts. This The Australian Bureau of Meteorology (BoM) is cur- scheme is based on a breeding cycle performed with the rently engaged in developing the science and systems that coupled model and generates perturbed atmosphere and could form an intraseasonal forecast service for Australia. ocean states about the central control member that is This has been driven by strong stakeholder interest, par- provided by the ocean and atmosphere analyses. A ticularly from the agricultural community (http://www. breeding approach has been used before within a cou- managingclimate.gov.au/publications/climag-18/; http://www. pled ocean–atmosphere system, although the primary grdc.com.au/Media-Centre/Ground-Cover-Supplements/ focus was to isolate ENSO coupled instabilities in the Ground-Cover-Issue-87-Climate-Supplement). This in- initial ensemble perturbations for seasonal to inter- traseasonal forecasting capability is based on BoM’s annual time-scale forecasts (e.g., Toth and Kalnay 1996; Predictive Ocean Atmosphere Model for Australia Cai et al. 2003; Yang et al. 2006). (POAMA) coupled model seasonal prediction system. The aims of this paper are to document the intra- Hudson et al. (2011a,c), Marshall et al. (2011a,c), and seasonal skill of the new intraseasonal version of the Rashid et al. (2010) assessed the skill of the POAMA-1.5 POAMA-2 system, which we refer to as the multiweek system (hereafter P1.5) for making intraseasonal pre- version (P2-M), and to compare it to the skill of the dictions of Australian climate and its ability to represent seasonal version of POAMA-2 (P2-S) and the older P1.5 and predict key drivers of intraseasonal climate vari- system. The focus will be primarily on forecast perfor- ability. Even though the P1.5 system was not designed mance for the Australian region but impacts of the new for intraseasonal prediction (e.g., in real time only a single POAMA ensemble generation strategy for prediction of forecast was made once per day, thereby necessitating the the MJO and the SAM will also be documented. The use of a lagged ensemble with lags of N days for N characteristics (e.g., structure and growth rates) of the members), there was promising skill for forecasts of the perturbations generated by the coupled-breeding tech- impending and subsequent fortnight (i.e., the mean of nique will be addressed in a separate paper. forecast days 1–14 and days 15–28, respectively) for rainfall and maximum temperature, particularly over eastern and southeastern Australia. Interestingly, fore- 2. POAMA-2 intraseasonal forecast system cast skill for fortnight 2 was found to be increased during a. Model and data assimilation details extremes of the El Nino–Southern~ Oscillation (ENSO), the Indian Ocean dipole (IOD), and the southern annular Dynamical intraseasonal–seasonal prediction at the mode (SAM), indicating that these slow variations of Bureau of Meteorology is based on POAMA-2, which is boundary forcing should be considered a source of in- a coupled ocean–atmosphere and data traseasonal climate predictability. Nonetheless, de- assimilation system. Prior to March 2013, the POAMA-2 ficiencies in the P1.5 system, which was designed for system was configured in two ways: one for multiweek seasonal climate prediction, were clearly acting to limit lead times for intraseasonal predictions (P2-M, an ex- the capability of making intraseasonal forecasts. perimental system) and one for seasonal predictions (P2-S, In the upgrade of POAMA to version 2, the most sig- an operational system). The model components are iden- nificant changes included a more advanced ocean data tical and the two systems differed only in the manner of assimilation scheme (Yin et al. 2011) and a multimodel ensemble initialization and in forecast frequency and formulation together with an increased ensemble size. length. P2-S was run less often but forecasts were longer Implicit in the new ocean data assimilation scheme is in order to target seasonal climate. P2-M was run more the generation of an ensemble of ocean states for ini- often and for shorter forecasts in order to target intra- tializing the forecasts. Although the POAMA-2 system seasonal variability. System details of P2-M, P2-S, and uses a ‘‘burst’’ ensemble generation approach (i.e., all P1.5 are summarized in Table 1. We note that in March ensemble members initialized at the same time), thereby 2013 the P2-M system replaced P2-S as the BoM oper- rectifying the need for a lagged ensemble to generate ational system for intraseasonal to seasonal prediction. intraseasonal forecasts, each ensemble member uses the The main modules of the coupled models in P2-M and same atmospheric initial conditions. We will show in this P2-S are identical and similar to those of the older P1.5.

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC DECEMBER 2013 HUDSON ET AL. 4431

TABLE 1. Configurations of the P2-M, P2-S, and P1.5 systems.

POAMA-2 (P2-M) POAMA-2 (P2-S) POAMA-1.5 (P1.5) Atmosphere–land data ALI (Hudson ALI (Hudson et al. 2011b) ALI (Hudson et al. 2011b) assimilation et al. 2011b) Ocean data assimilation PEODAS (Yin PEODAS (Yin et al. 2011) Univariate optimum et al. 2011) interpolation (Smith et al. 1991). Ensemble generation 33 members 30 members 10 members Multimodel (P2a, P2b, P2c) Multimodel (P2a, P2b, P2c) Single model (identical to P2c) Burst ensemble: ocean Burst ensemble: ocean Lagged ensemble: lagged and atmosphere perturbations from atmosphere initial perturbations from PEODAS; no atmosphere conditions; no ocean the CEI scheme perturbations perturbations

The component is the BoM’s At- atmosphere model of POAMA (run prior to hindcasts mospheric Model (BAM version 3.0; Colman et al. 2005; or forecasts being made and forced with observed SST) Wang et al. 2005; Zhong et al. 2006), which has a T47 toward an observationally based analysis. ALI nudges horizontal resolution (approximately a 250-km grid) and to the analyses from the 40-yr European Centre for 17 levels in the vertical. This horizontal resolution, to- Medium-Range Weather Forecasts (ECMWF) Re- gether with the grid configuration, means that Tasmania Analysis (ERA-40; Uppala et al. 2005) for the period is not resolved as land in POAMA. The land surface 1980–August 2002, and to the BoM’s operational global component in POAMA is a simple bucket model for soil numerical weather prediction (NWP) analysis thereafter. moisture (Manabe and Holloway 1975) and has three The scheme also generates land surface initial conditions soil levels for temperature. The ocean model is the that are in balance with the atmospheric condition. Australian Community Ocean Model version 2 (ACOM2; The ocean analysis system for P1.5 was based on Schiller et al. 1997; Schiller et al. 2002; Oke et al. 2005) a univariate optimum interpolation system that assimi- and is based on the Geophysical Fluid Dynamics Lab- lates only in situ temperature observations (Smith et al. oratory Modular Ocean Model (MOM version 2). The 1991). Due to the lack of appropriate multivariate for- zonal resolution of the ocean grid is 28, and the meridi- mulations, this approach has a detrimental effect on the onal spacing is 0.58 within 88 of the equator, increasing salinity and velocity fields of the model. To rectify these gradually to 1.58 near the poles. ACOM2 has 25 vertical limitations, a new ocean analysis system, the POAMA levels, with 12 levels in the top 185 m and a maximum Ensemble Ocean Data Assimilation System (PEODAS; depth of 5 km. The atmosphere and ocean models are Yin et al. 2011), has been developed. PEODAS is an coupled using the Ocean Atmosphere Sea Ice Soil approximate form of the ensemble Kalman filter and (OASIS) coupling software (Valcke et al. 2000). generates an ensemble of ocean states each day, including In addition to configuring POAMA-2 for intra- a central unperturbed ocean analysis. PEODAS is based seasonal prediction (described in section 2b), P2-M and on the multivariate ensemble optimum interpolation P2-S have major upgrades over P1.5 in the following system of Oke et al. (2005), but uses covariances derived aspects: from the time-evolving model ensemble. In situ tem- perature and salinity observations are assimilated, and d an advanced ocean data assimilation scheme (Yin corrections to currents are generated based on the en- et al. 2011), semble cross covariances with temperature and salinity. d a multimodel ensemble (MME) approach (Wang et al. PEODAS has demonstrated a significant quantitative 2011), and improvement in the depiction of the initial state of the d a larger ensemble over a longer hindcast period. upper ocean, especially in relation to salinity (Yin et al. Forecasts from all versions of POAMA are initialized 2011; Zhao et al. 2013b). from observed atmospheric and oceanic states. The at- To address model uncertainty, POAMA-2 (both P2-S mosphere and land initial conditions are provided by the and P2-M) has adopted a pseudo-MME strategy using Atmosphere–Land Initialization (ALI) scheme for both three different model configurations: P1.5 and P2 (Hudson et al. 2011b). ALI creates a set of atmosphere and land initial states by nudging zonal and d P2a—modified atmospheric physics configured to use an meridional winds, temperatures, and humidity from the alternative shallow convection physical parameterization

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC 4432 MONTHLY WEATHER REVIEW VOLUME 141

scheme that results in less climate drift compared to P1.5, d P2b—bias-corrected version of P2a (i.e., same model version as P2a but with ocean–atmosphere fluxes corrected to reduce climatological biases), and d P2c—standard physics options that are identical to those in P1.5. b. Ensemble generation The 10-member ensemble of forecasts for P1.5 was generated using lagged atmospheric initial conditions from ALI by successively initializing each member with the atmospheric analysis from 6 h earlier (i.e., the 10th member was initialized 2.25 days earlier than the first member). Each ensemble member used the identical ocean initial condition. In contrast, POAMA-2 uses a burst ensemble (i.e., all initial conditions are valid for the same date and time; there are no lagged initial conditions). The 30-member P2-S ensemble comprises a 10-member ensemble for each model version (P2a, P2b, P2c) that is generated using 10 ocean initial conditions provided directly by PEODAS (each model version uses the same set of 10 ocean initial conditions). However, in contrast to P1.5, all ensemble members use the identical atmospheric initial condition. Not perturbing the atmosphere is ac- ceptable for seasonal predictions, but for intraseasonal prediction there is not enough spread generated by the model in the first month of the forecast. This can be seen in Fig. 1, which shows that the ensemble spread of 500-hPa heights from P2-S at day 10 of the forecast is an order of magnitude smaller than that obtained from P1.5 (Figs. 1a and 1b). However, the spread at day 10 from P2-M (Fig. 1c) is comparable (if not slightly larger) to that obtained from P1.5. Details of the en- semble generation strategy for P2-M are described below. The P2-M forecasts are initialized using a coupled- FIG. 1. Ensemble spread of 500-hPa heights (m) at day 10 of breeding scheme, which we refer to as the Coupled forecasts initialized on 1 Jul 1980–2006 for (a) P1.5, (b) P2c-S, and Ensemble Initialization (CEI) scheme. A schematic of (c) P2c-M. All three systems use the same model (see text for de- the method is shown in Fig. 2. The breeding scheme tails) and results are displayed for 10-member ensembles. Spread produces 10 perturbed states about the central analyses greater than 10 m is depicted by solid contours with an interval of provided by the PEODAS central analysis for the ocean 10 m. Spread less than 10 m is depicted by dashed contours with an interval of 2 m. The latter was done to emphasize that all spread in and the ALI analysis for the atmosphere–land. During (b) is less than 10 m. each breeding cycle the following steps are taken: 1) The 11-member ensemble (using the P2a coupled PEODAS for the ocean and ALI for the atmosphere model), including one central unperturbed control and land). and 10 perturbed members, is integrated forward 2) At the end of the 24-h integration (t 1 24), the bred 24 h from an initial time t. The purpose of the central perturbations (atmospheric u, y, T, q and oceanic coupled model run is to generate the 24-h forecast T, S, u, y) are determined as the difference between for calculation of the perturbations and this control each perturbed ensemble forecast and the central is initialized at time t using the central analysis (i.e., control forecast.

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC DECEMBER 2013 HUDSON ET AL. 4433

Considering the different characteristics of the time scales of variability in the ocean and atmosphere and their strong spatial dependence, different rescaling norms that are functions of space are used for each component. Zonally averaged root-mean-square difference (RMSD) of 10-m zonal wind (u10; only a function of latitude) and three-dimensional (3D) RMSD of ocean temperatures have been selected as the atmospheric and oceanic rescaling norms, respectively. The rescaling factor for each ensemble member is defined as the ratio of the norm calculated from the estimated analysis error vari- ance and the norm of the forecast perturbations at each

FIG. 2. The CEI system. grid point. If the variance of the ensemble perturbations about the central member does not exceed the estimated analysis error variance, then no rescaling for that day is 3) The forecast perturbations of each ensemble mem- applied. The analysis error of u10 is estimated by the ber are then scaled with respect to the estimated RMSD of u10 between ALI and ERA-40 (Uppala et al. analysis error (discussed below) and centered by 2005). The 3D analysis error of oceanic temperature is subtracting the mean of the rescaled perturbations estimated by the ensemble spread from PEODAS. from all ensemble members. The expectation is that The CEI scheme is closely aligned with the long-term this limited number of perturbations will have an ini- goal of fully coupled data assimilation with POAMA. tial spread distribution that is similar to the analysis The perturbations are generated using the coupled ocean– error variance. atmosphere model and therefore the resulting atmo- 4) The ensemble members are then updated by adding sphere and ocean initial perturbations are consistent. the perturbations to the central analysis (PEODAS These perturbations also display significant state de- for the ocean and ALI for the atmosphere–land) at pendence for both atmosphere and ocean components. time t 1 24. These 10 coupled updated states, together The 33-member P2-M ensemble comprises an with the central analysis, are used for initializing the 11-member ensemble for each model version (P2a, P2b, hindcasts and forecasts. P2c). Each model version uses the same set of 11 initial 5) Repeat steps 1–4 above. conditions (i.e., the 10 perturbed and 1 central analysis provided directly by the CEI scheme). All ensembles are cycled every 24 h for the entire c. Hindcast and real-time configuration 1980–2010 period. Prior to this period, there is an initial 2-week spinup, started from ocean perturbations from Hindcasts from P2-M were generated 3 times per PEODAS. In this spinup there is no rescaling, so as to month (on the 1st, 11th, and 21st) for the period 1980– allow fast weather instabilities to saturate. CEI uses 2010 and the real-time system is initialized every a one-sided breeding method (i.e., rather than positive– Thursday at 0000 UTC. P2-S hindcasts were made just negative paired or two sided), also referred to as once per month (on the 1st) and the real-time seasonal subtract-mean breeding (Toth and Kalnay 1997; Wang forecasts were made on the 1st and 15th of each month. et al. 2004). Wei et al. (2008) found that experiments Although the P2-S system can be used for intraseasonal using one-sided breeding have performed significantly prediction, the new P2-M system is more ideally suited better than the paired perturbations with performance because it is initialized once per week (or once every almost as high as that found with the ensemble transform 10 days for the hindcasts) and the atmospheric initial technique. The better performance with one-sided breed- conditions are perturbed. To demonstrate the improved ing is related to the ensemble centering strategy. By capability of the P2-M system, this paper compares simply removing the mean from all perturbations, intraseasonal hindcasts from P2-M to intraseasonal hind- the independence of these perturbations is preserved, casts from P2-S and P1.5 over the common period 1980– whereas the paired centering scheme reduces the effec- 2006, for hindcasts initialized on the first of the month. tive number of degrees of freedom of the ensemble sub- d. Verification methodology space by half and may result in worse probabilistic scores. The one-sided breeding method in CEI is similar to the Forecast skill is assessed using anomalies from the centering strategy used in the PEODAS ensemble setup hindcast climatology. These anomalies are created (Yin et al. 2011). by producing a lead-time-dependent ensemble mean

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC 4434 MONTHLY WEATHER REVIEW VOLUME 141 climatology from the hindcasts. This climatology is climatology. Positive (negative) BSS values indicate a function of both start date and lead time, and thus forecasts that are better (worse) than climatology. In a first-order linear correction for model mean bias is this study we use a formulation of the BSS referred to as made (e.g., Stockdale 1997). the debiased BSS (Muller€ et al. 2005; Weigel et al. 2007; The focus of this paper is on the skill of the forecast for Mason and Stephenson 2008). Further details of these the first and second fortnight (average of days 1–14 and verification methods are provided in Hudson et al. 15–28, respectively) for precipitation and temperature (2011a). over Australia. The BoM National Climate Centre gridded daily analyses (averaged into fortnights) of 3. Hindcast verification precipitation and maximum (Tmax) and minimum (Tmin) temperature are used for the verification. These gridded This section compares the complete ensemble of analyses are produced from quality-controlled station P2-M (33-member ensemble) with P2-S (30-member data by the application of a three-pass Barnes successive- ensemble) and P1.5 (10-member ensemble) in order to correction analysis (Mills et al. 1997). They are avail- highlight the overall benefit of the new P2-M system. able on a 0.258 grid and are averaged onto the POAMA a. Geopotential height over the Southern Hemisphere T47 atmospheric grid (roughly 250 km). Observed anom- alies are created relative to the observed climatology Characteristics of ensemble spread (standard devia- (using the 1980–2006 period) for direct comparison tion of the ensemble members about the ensemble with the POAMA anomalies. The forecast quality for mean) and ensemble mean RMSE with lead time for 500-hPa geopotential height anomalies across the ex- daily 500-hPa geopotential height anomalies over the tratropical Southern Hemisphere are also assessed, which Southern Hemisphere extratropics (208–608S) for the provides some insight into the capability of predicting three systems are shown in Figs. 3a–c. The scores are all the large-scale synoptic variability of the midlatitude normalized by the observed standard deviation. For a circulation. The National Centers for Environmental climatological forecast, the normalized RMSE (NRMSE) Prediction–Department of Energy (NCEP–DOE) At- would be one, and thus forecasts are typically deemed mospheric Model Intercomparison Project (AMIP-II) skillful for NRMSE , 1 (i.e., forecasts are skillful when Reanalysis (R-2) (Kanamitsu et al. 2002) of daily 500-hPa the error is smaller than that from forecasting clima- heights are used for verification. tology). P2-M (Fig. 3c) skillfully predicts daily geo- Verification of the ensemble mean is shown in the potential height anomalies out to about 12 days, compared form of correlation, which measures the linear corre- to 8 days for P1.5 (Fig. 3a) and about 6 days for P2-S spondence between the ensemble mean forecast and (Fig. 3b). P2-S (Fig. 3b) starts off with a similar magni- observed anomalies, and the root-mean-square error tude NRMSE to P2-M, but the error in P2-S grows (RMSE) of the anomalies, which provides information rapidly and instead of asymptotically approaching a cli- on errors in forecast amplitude. For probabilistic veri- matological forecast (NRMSE 5 1), it overshoots this fication, focus is placed on the forecast probability of threshold. This is due to the lack of ensemble spread in exceeding tercile thresholds. Calculation of the terciles P2-S (Fig. 3b) such that the ensemble mean still has from both the model and observations are subject to significant amplitude (less canceling out than if there leave-one-out cross validation. The verification metrics was more spread). P2-M also has a smaller error than for the probabilistic forecasts are based on contingency P1.5 in the first 20 days of the forecast and a larger en- tables of the number of observed occurrences and non- semble spread. The ensemble spread in P1.5 grows more occurrences of the event (e.g., temperature in the upper slowly compared to P2-M, indicating that lagged atmo- tercile) falling into predefined forecast probability bins. spheric initial conditions are not an optimal way to gen- In this study, five probability bins are defined: 0–0.2, 0.2– erate forecast spread. In a perfect ensemble system, over 0.4, 0.4–0.6, 0.6–0.8, and 0.8–1.0. For calculation of the a large sample of forecasts, the ensemble spread will be tercile thresholds, anomaly data from all of the ensem- equal to the RMSE of the ensemble mean for all lead ble members are used. Relative operating characteristic times (Palmer et al. 2006; Weisheimer et al. 2011). The (ROC) scores and curves, attributes diagrams, and Brier P2-M system has the best relationship between error and score metrics are used (e.g., Mason and Graham 1999; spread among the three systems. The other two systems Mason and Graham 2002; Toth et al. 2003; Wilks 2006; are clearly underdispersive (the ensemble spread is Mason and Stephenson 2008). The relative quality of smaller than the error). the Brier score is measured with the Brier skill score Figure 3d shows the average spatial (pattern) corre- (BSS), which is defined as the improvement of the prob- lation coefficient as a function of lead time for 500-hPa abilistic forecast relative to a reference forecast, usually height anomalies over the Southern Hemisphere. There

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC DECEMBER 2013 HUDSON ET AL. 4435

FIG. 3. Normalized RMSE (solid) and ensemble spread (dashed) for (a) P1.5 (green), (b) P2-S (blue, 30-member ensemble) and P2c-S (gray, 10-member ensemble), and (c) P2-M (red, 33-member ensemble) and P2c-M (gray, 10-member ensemble) for 500-hPa geopotential height anomalies over the Southern Hemisphere (208–608S) for forecasts initialized on the first of June, July, and August over 1980–2006. Normalized RMSE and spread are calculated for each model grid box and then area averaged. (d) Corresponding average spatial cor- relation with observed for P1.5 (green), P2-S (blue), and P2-M (red). The verification data are from R-2 (Kanamitsu et al. 2002). is not much difference between the three systems over forecast systems are shown in Fig. 4. These curves and the first 5 days of the forecast, but as the lead time in- diagrams are based on contingency tables using all land creases, P2-M emerges as having a slightly higher cor- grid boxes over Australia for all forecast start months. relation so that around lead times of 10 days, P2-M is ROC curves provide information on forecast resolution about 2 days better than P2-S and 1 day better than P1.5 by measuring the ability of a forecast to discriminate (although the differences between these correlations are between the occurrence and nonoccurrence of an event, not statistically significant at a 10% level, calculated in this case, rainfall falling in the upper tercile. The no- using Fisher’s Z transformation and n 5 81). skill line on a ROC curve is the diagonal, where hit rates equal false alarm rates. A forecast system with positive b. Precipitation and temperature forecasts over skill has a curve that lies above the diagonal and bends Australia toward the top-left corner (0.0, 1.0), such that hit rates ROC curves and attribute diagrams for forecasts of exceed false alarm rates. The ROC curves exhibit a de- precipitation across Australia in the upper tercile for the cline in skill from the first to the second fortnight in all first and second fortnights of the forecast for all three three systems, although skill still prevails in the second

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC 4436 MONTHLY WEATHER REVIEW VOLUME 141

FIG. 4. (a),(b) ROC and (c),(d) attributes diagrams of the probability that 2-week mean precipitation is in the upper tercile for P1.5 (green), P2-S (blue), and P2-M (red) for all forecast start months and model grid boxes over Australia. (left) The first fortnight (days 1–14) and (right) the second fortnight (days 15–28). The diagonal line in the ROC diagrams in (a),(b) represents the no-skill line. In the attributes diagrams [(c),(d)], the inset histogram represents the frequency of forecasts in each forecast probability bin (bins are 0.2 wide). The solid diagonal line represents perfect reliability and points falling in the shaded region contribute positively to forecast skill. fortnight (Figs. 4a and 4b). The ROC curve for P2-M is the shaded region are forecasts that contribute posi- better than the other two systems in the first fortnight, tively to forecast skill (as defined by the Brier skill score, but there is not much difference between the three thus incorporating a combined effect of reliability and systems in the second fortnight. resolution) (Wilks 2006). Clearly, P2-M has greatly in- The main difference between the forecast systems is creased reliability compared to P2-S and P1.5 (Figs. 4c evident in the attribute diagrams. An attribute diagram and 4d). The distinction in reliability between the sys- shows the conditional relative frequency of occurrence tems is less dramatic in the second fortnight, but in con- of an event (observed relative frequency) as a function trast to the other two systems, the forecasts from P2-M of the forecast probability, thereby providing infor- still contribute positively to the overall skill (as indicated mation on forecast reliability. The diagonal line on the by being in the shaded area of the diagram; Fig. 4d). All diagram indicates perfect reliability and points falling in the forecast systems are overconfident (points lie below

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC DECEMBER 2013 HUDSON ET AL. 4437

FIG. 5. ROC area of the probability that precipitation averaged over the second fortnight (days 15–28) is in the upper tercile for each of the four seasons for (top) P1.5, (middle) P2-S, and (bottom) P2-M. ROC areas significant at the 5% significance level are shaded (Mann– Whitney U test) and the contour interval is 0.05. the diagonal for high probabilities and above the di- then the system would have no sharpness (sharpness is agonal for low probabilities; Figs. 4c and 4d), but this is the tendency to forecast extreme values). The histo- particularly apparent for P2-S and P1.5. Overconfidence grams accompanying the reliability diagrams for the is a common situation for seasonal forecasting (Mason forecast in the second fortnight (Fig. 4d) indicate that for and Stephenson 2008). It is evident from Fig. 4 that the P2-M, the forecast probabilities peak in frequency at the older P1.5 system is less overconfident and more reliable climatological probability. This is in contrast to the re- than the newer P2-S system, particularly during fort- sults from the first fortnight (Fig. 4c), where the fre- night 1. This is essentially due to the fact that P1.5 per- quency of forecasts peaks in the lowest probability bins. turbs the atmosphere initial conditions, whereas P2-S This indicates that as the forecast lead time progresses, does not. The consequence of the latter is a lack of en- the forecasts become less sharp (i.e., there is a tendency semble spread in P2-S (compared to P1.5) during the toward forecasts of climatology). Note that owing to the first 3 weeks of the forecast such that the spread does not lack of ensemble spread in P2-S, forecast probabilities encompass the error; the model is underdispersive (e.g., are still sharp during the second fortnight (Fig. 4d). This see Fig. 3). This small spread also has the effect of dra- is problematic, since the associated forecasts are not matically reducing the effective ensemble size and a re- reliable. duced ensemble size is known to have a negative impact Spatial and seasonal variations in forecast skill are on forecast reliability (Doblas-Reyes et al. 2008). shown in Fig. 5 in the form of ROC areas for predicting Deviations from perfect reliability in the diagram can rainfall in the upper tercile. A ROC area (area under the be due to sampling limitations rather than necessarily ROC curve) below 0.5 implies a skill level lower than true deviations from reliability (e.g., Toth et al. 2003). climatology [statistically significant ROC areas are shaded As such, a reliability diagram is accompanied by a his- in Fig. 5, determined using the Mann–Whitney U statistic; togram that indicates the sample size in each probability Mason and Graham (2002); Wilks (2006)]. ROC areas in bin. The histogram can be used to indicate the confi- the first fortnight are high across most of the country dence associated with the result for each probability bin, (not shown), but there is a pronounced degradation in as well as the sharpness of the forecast system. If all the skill in the second fortnight. Most of the skill in the first forecasts fell in the model climatology probability bin, fortnight comes from the first week of the forecast (not

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC 4438 MONTHLY WEATHER REVIEW VOLUME 141

FIG. 6. As in Fig. 4, but for maximum temperature above the upper tercile. shown). Even in the first fortnight when all three systems Figs. 6–9. Forecasts of Tmax in the first fortnight from have their highest skill, P2-M is a clear improvement P2-S and P1.5 are significantly less reliable and con- over P1.5 and especially over P2-S (not shown). In the tribute less to the overall skill than P2-M (for the most second fortnight (Fig. 5), P2-M is again generally more part the P2-S and P1.5 points fall outside the skillful skillful than P1.5 and P2-S, although the areas of skill shaded area; Fig. 6c). In the second fortnight, Tmax now contract and are generally highest in the east in the forecasts from P2-S and P1.5 are unreliable, but fore- austral winter and spring. This is where the impacts of casts from P2-M remain somewhat reliable (Fig. 6d). In ENSO and the IOD [i.e., SST anomaly differences be- particular, the reliability curve for P2-S in the second tween the western and eastern tropical Indian Ocean; fortnight represents a system that has very little reso- Saji et al. (1999)] on rainfall are strongest and express lution: the conditional outcome is virtually the same themselves even as early as the second fortnight of the regardless of the forecasted probability; that is, the forecast (Hudson et al. 2011a). forecasts are unable to resolve times when upper tercile As for rainfall, ROC curves, attribute curves, and Tmax is more or less likely than the overall climato- spatial ROC area plots for Tmax (above the upper ter- logical probability (Fig. 6d). The same can be said of all cile) and Tmin (below the lower tercile) are shown in the systems for Tmin in the second fortnight; they are

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC DECEMBER 2013 HUDSON ET AL. 4439

FIG. 7. As in Fig. 5, but for maximum temperature above the upper tercile.

all unreliable, exhibit very little resolution, and do not found that the relationship of the SAM with daily Tmin contribute positively to the overall skill (Fig. 8d). In over Australia was weaker than the relationship with contrast, for the first fortnight the forecasts from P2-M daily Tmax. do contribute positively to overall skill and are more Interestingly, the greater reliability in the P2-M system reliable than P2-S and P1.5 (Fig. 8c). ROC areas for compared to P2-S and P1.5 is still evident on seasonal Tmax and Tmin exhibit similar dropoffs in skill from time scales. Figure 10 shows the attributes diagrams from the first to second fortnights, as did rainfall. However, each system for rainfall, Tmax, and Tmin over Australia Tmax forecasts are generally more skillful (higher for the first season (zero lead) of the forecast. P2-M has ROC scores) than those for rainfall, especially for the increased reliability, resolution, and skill compared to first fortnight (Figs. 4 and 6). As for rainfall, P2-M is P1.5 and P2-S for rainfall and Tmax, such that the curve again generally more skillful than P1.5 and P2-S for clearly lies within the skillful region for P2-M. For Tmin Tmax over Australia (Figs. 6 and 7) and in the second the forecasts from all three systems are unreliable, show fortnight this skill is highest over eastern Australia little resolution, and do not contribute positively to skill. during spring (September–November, SON; see Fig. 7). Beyond the first season, the differences in reliability For forecasts of the second fortnight, not only is Tmin between P2-M and P2-S are small (if any), suggesting the least reliable of the three variables (Fig. 8d) but it that the improved reliability and skill in rainfall and also tends to have the lowest ROC areas (Fig. 9). The Tmax are coming primarily from the first month of the reasons why the model has less skill for Tmin compared forecast. Also of interest, is that in contrast to intra- to Tmax are unclear. As suggested in Hudson et al. seasonal time scales, beyond the zero-lead first season, (2011a), it may be related to model error, but it is pos- P2-S becomes much more reliable for Tmax (not shown) sible that it is a reflection of the reduced predictability and rainfall (Wang et al. 2011) than is shown in Figs. 4, for Tmin. In an examination of the relationship between 6, and 10, and is noticeably more reliable than P1.5 (in ENSO and Australian land surface temperature, Jones contrast to Figs. 4, 6, and 10). The greater reliability of and Trewin (2000) found that statistically significant both P2-S and P2-M at seasonal time scales compared correlations between the Southern Oscillation index to P1.5 is due more to the use of a multimodel combi- (SOI) and seasonal mean Tmin were less widespread nation rather than a larger ensemble size (Wang et al. than those for Tmax. Similarly, Hendon et al. (2007) 2011).

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC 4440 MONTHLY WEATHER REVIEW VOLUME 141

FIG. 8. As in Fig. 4, but for minimum temperature below the lower tercile. c. The MJO and SAM the impact of the different ensemble generation strat- egies on the prediction of the indices [note: the hindcast We also demonstrate a positive impact of the new years used in Marshall et al. (2011b) differ from the ensemble generation strategy for the prediction of some present study, the former using 1989–2006]. of the key components of the large-scale atmospheric The MJO is a prominent mode of tropical intraseasonal circulation that support intraseasonal predictability. variability consisting of large-scale coupled patterns in Previous work has investigated the ability of P1.5 to atmospheric circulation and deep convection that prop- predict the MJO (Rashid et al. 2010; Marshall et al. 2011a) agate eastward over the equatorial Indian and western and the SAM (Marshall et al. 2011c), both important Pacific Oceans. The state of the MJO is described using drivers of intraseasonal climate variability over Australia the bivariate Real-time Multivariate MJO (RMM) in- (Hendon et al. 2007; Risbey et al. 2009; Wheeler et al. dex (Wheeler and Hendon 2004). The RMM index is 2009). Marshall et al. (2011b) provided an update to this calculated from a combined empirical orthogonal func- work by comparing the skill in predicting these indices by tion (EOF) analysis of 158S–158N-averaged outgoing P2-M with that from P1.5. Here, we add to this work by longwave radiation (OLR) and 850- and 200-hPa zonal including P2-S in the comparison in order to investigate wind anomalies. The predicted RMM index for POAMA

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC DECEMBER 2013 HUDSON ET AL. 4441

FIG. 9. As in Fig. 5, but for minimum temperature below the lower tercile. is computed by projecting each ensemble member’s of the MJO compared to P1.5 and P2-S. P2-S has the predicted equatorially averaged OLR and zonal wind largest error (Fig. 11a) and lowest correlation (Fig. 11b) anomalies onto the observed EOF pair [calculated from of the three systems over the 35 days shown. In terms of the NCEP–National Center for Atmospheric Research the bivariate RMSE, forecast skill improves by about (NCAR) reanalysis; Kalnay et al. (1996)]. The forecasts 1 week lead time for P2-M compared to P1.5 for lead are then verified using a bivariate correlation and bi- times beyond 2 weeks (e.g., the error at 15 days in P1.5 is variate RMSE. For more details of this verification equivalent to the error at 21 days in P2-M; Fig. 11a). The technique, see Rashid et al. (2010). Figure 11 shows the bivariate RMSE of P2-M reaches the RMSE obtained forecast skill for the bivariate RMSE and correlation for a climatological forecast after 33 days (a climato- using the ensemble mean. P2-M has improved forecasts logical forecast of the RMM index has an RMSE equal

FIG. 10. Attributes diagrams of the probability that (a) rainfall, (b) maximum temperature, and (c) minimum temperature in the first season of the forecast is in the upper tercile (lower tercile for minimum temperature) for P1.5 (green), P2-S (blue), and P2-M (red) based on all start months and model grid boxes over Australia. Plotting convention is as in Fig. 4.

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC 4442 MONTHLY WEATHER REVIEW VOLUME 141

FIG. 11. (a) RMSE of the ensemble mean (solid curves) and ensemble spread (dashed curves) for the bivariate RMM index, and the (b) bivariate correlation for the RMM index using the ensemble mean. Red curves are for P2-M, blue curves for P2-S, and green curves for P1.5. Scores are shown as a function of lead time (days) and all forecast start months are used. to O2). Figure 11a also shows that the bivariate ensemble anomalies (NCEP–NCAR reanalysis; Kalnay et al. 1996) spread is always smaller than the error in all three sys- onto the leading EOF of observed zonally averaged tems, indicative of underdispersed systems. However, monthly mean MSLPs between 258 and 758S. The P2-M is a marked improvement over the other two POAMA SAM index is calculated by projecting the daily systems, suggesting that the new ensemble generation mean predicted MSLP anomalies from each POAMA technique had a positive impact on increasing the en- ensemble member onto the observed EOF [see Marshall semble spread, but not at the expense of increasing the et al. (2011c) for details]. Figure 12 shows the ensemble RMSE. As for the 500-hPa heights, the RMSE of P2-S mean RMSE and correlation, and the ensemble spread overshoots the expected climatological threshold due to for the three forecast systems. For a climatological fore- a lack of ensemble spread (Fig. 11a). cast of the SAM index, the RMSE 5 1. P2-M reaches this If we consider a correlation of 0.5 as the threshold for threshold after about 17 days, whereas P1.5 reaches it defining skillful forecasts of the MJO (e.g., Kang and after 14 days (Fig. 12a). As seen for the MJO index, P2-S Kim 2010; Rashid et al. 2010; Arribas et al. 2011; Vitart overshoots this climatological threshold due to the lack of and Molteni 2011), then the correlation of the bivariate ensemble spread (Fig. 12a). The P2-M system has signif- RMM index drops to 0.5 at days 17, 20, and 22.5 in P2-S, icantly more spread than the P2-S system and more spread P1.5, and P2-M, respectively (Fig. 11b). This result for than P1.5 for the first 15 days of the forecast. Interestingly, P2-M is comparable to the correlation skill obtained for the spread in P2-M actually exceeds the error (over- the ECMWF system (version Cy32r3), where correla- dispersive) for the first 5 days of the forecast. Correlation tion skill greater than 0.5 is obtained out to 23 days skills above 0.5 are achieved out to 12 days for P2-M and (Vitart and Molteni 2011). The correlations from P2-M out to 10 days in P1.5 (Fig. 12b). The new P2-M model are significantly different from P2-S (but not from P1.5) therefore has a slightly improved ability to predict the over days 11–30 of the forecast [5% significance level, SAM index. The correlations from P2-M are significantly two tailed, Fisher’s Z transformation; n 5 648 (i.e., 12 different from P2-S (but not from P1.5) over days 14–18 of start months, 27 yr, and two independent components the forecast [5% significance level, two tailed, Fisher’s Z comprising the MJO index)]. transformation; n 5 324 (i.e., 12 start months and 27 yr)]. The SAM is an important mode of variability for high and middle latitudes and is characterized by shifts in the 4. Contribution to improvement: Ensemble strength of the zonal flow between about 558–608Sand generation strategy, multimodel formulation, or 358–408S (e.g., Gong and Wang 1999; Thompson and ensemble size? Wallace 2000). To assess the skill in predicting the SAM, a daily SAM index is calculated for the observations by The section above has shown that the P2-M system is projecting observed daily mean sea level pressure (MSLP) more skillful and reliable than both the P1.5 and P2-S

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC DECEMBER 2013 HUDSON ET AL. 4443

FIG. 12. As in Fig. 11, but for the SAM index. systems for intraseasonal and even the first season y axis at the climatologically observed frequency. If a forecasts. The major difference between P2-M and P2-S system had no resolution, the reliability curve would lie is the new ensemble generation strategy, which provides on this horizontal ‘‘no resolution’’ line (Wilks 2006). In perturbations to both the atmosphere and ocean at the contrast, the REL component measures the conditional initial time (P2-M) compared to only perturbations bias of the probabilistic forecasts (i.e., the correspon- to the ocean (P2-S). The other minor difference is that dence between the forecast probability and the relative P2-M has 33 ensemble members and P2-S has 30 mem- observed frequency). In terms of the attributes diagram, bers (with 11 and 10 unique initial conditions, respec- REL measures the deviation of the curve from the di- tively). It seems clear, therefore, that the improvements agonal 1:1 line, such that perfect reliability will give in P2-M over P2-S can be ascribed to the new ensemble REL 5 0. generation technique. However, the reason for the im- The conclusions from the ROC area and RES results provement of P2-M over P1.5 is not yet clear: it could in Fig. 13 are very similar to each other [Toth et al. be due to the coupled-breeding ensemble generation (2003) note that the resolution component of the Brier technique of P2-M compared to the lagged-average score and the ROC area often provide very similar qual- ensemble method of P1.5; it could be due to the larger itative information of the forecasting system]. In fortnight ensemble size of P2-M (33 members) compared to P1.5 1 for all seasons, the P2-M system has the highest dis- (10 members) or it could be due to the benefits of the crimination (ROC area) and resolution (RES) com- multimodel configuration of P2-M compared to the pared to the other two systems, whereas the differences single model of P1.5. between the systems are smaller in fortnight 2 (consis- Figure 13 shows the ROC area, as well as the resolu- tent with Fig. 4). The multimodel configuration, which tion (RES) and reliability (REL) components of the also implies a larger ensemble size, has little or no im- Brier score, for rainfall over Australia above the upper pact on the forecast resolution and discrimination, shown tercile. The Brier score can be decomposed into three by negligible differences between the constituent models components, two of which measure forecast resolution and the relevant multimodel for both P2-M and P2-S, and reliability (Murphy 1973). The third term is a func- respectively (Figs. 13a, 13b, 13d, and 13e). We note that tion of the climatological frequency of the occurrence of ensemble size alone does not normally have a large im- the event and is not related to forecast quality. A fore- pact on forecast resolution (Doblas-Reyes et al. 2008). cast system has good resolution and a high value for the The biggest difference between the three systems and RES component if the forecast probabilities can discern within each system is related to forecast reliability. It is subsample forecast periods with different relative fre- clear that P2-M is the most reliable system for fortnights quencies of the event [i.e., frequencies that differ from 1 and 2 in all seasons, but there are other important in- the climatological base rate; Wilks (2006)]. In terms of ferences that can be made from Figs. 13c and 13f. First, the attributes diagram, RES measures the deviation the benefit of the multimodel and larger ensemble size of the curve from the horizontal line that intersects the (compared to the constituent models) is quite small for

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC 4444 MONTHLY WEATHER REVIEW VOLUME 141

FIG. 13. (left) The ROC area, (middle) the RES, and (right) the REL for 2-week mean rainfall in the upper tercile for the (a)–(c) first fortnight (days 1–14) and (d)–(f) second fortnight (days 15–28) of the forecast. Scores are calculated using all Australian land points. Scores are shown for component models as well as multimodel means (see key). both P2-M and P2-S in the first fortnight. This is also than from an increased ensemble size. For instance, evident in the previously shown plot of RMSE and en- ENSO behaves differently in each of the component semble spread (Fig. 3b; P2-S and 3c P2-M), where the models of POAMA-2 and this will affect the simulated gray lines, giving the result from one of the constituent teleconnections to regional rainfall and temperature 10-member ensemble models, can be contrasted with (Wang et al. 2011). the colored lines showing the larger MME ensemble. The reason for the modest contribution of the larger In both systems, there is not much difference between ensembles of P2-S and P2-M to the forecast spread and the RMSE and spread of the larger MME compared to reliability on intraseasonal time scales is because their the smaller single-model ensemble on the intraseasonal component models are initialized using the same set of time scale. Figure 13 does, however, suggest that in fort- initial conditions; that is, for P2-M (P2-S) there are only night 2 the larger MME has a more significant positive 11 (10) unique initial conditions available for initializa- contribution to forecast reliability in relation to the tion. The impact of this is illustrated in an example of an constituent models compared to fortnight 1. As men- MJO forecast using P2-M (Fig. 14). In general, there is tioned in the previous section, on the seasonal time scale less separation of the forecast plumes when grouped the benefit of the larger MME on forecast reliability is by ensemble member (Fig. 14a) than when grouped by clear (Wang et al. 2011). Seasonal forecasts (3-month constituent model (Fig. 14b) (cf. forecasts with the same average with a 1-month lead time) of Australian rain- color in Fig. 14). For the former, there are distinct groups fall above the upper tercile from the large multimodel of three forecast plumes that indicate ensemble members are reliable, whereas those from the constituent models from the three constituent models have been initialized are not (Wang et al. 2011). Wang et al. (2011) found with the same initial conditions (Fig. 14a). This clearly that the major contribution to the improved reliability signifies the need for using different initial conditions of the P2-S multimodel compared to the component for each of the three models, to better simulate the en- models is from the use of three model versions rather semble spread at these time scales. For the next version

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC DECEMBER 2013 HUDSON ET AL. 4445

FIG. 14. Example of an MJO forecast initialized on 17 Nov 2011 from the P2-M system. The 33 forecast plumes have been grouped (same color) by (a) ensemble member (i.e., those forecasts using the same initial conditions have the same color) and (b) model. of POAMA we plan to use a unique initial condition for reliability over P1.5 due to the different ensemble gen- each member. eration technique, namely a burst ensemble with per- The component models P2c-M and P2c-S have exactly turbed atmosphere and ocean initial conditions compared the same model formulation as P1.5 and their ensembles to a time-lagged ensemble (with no ocean perturbations). are of comparable size. Comparing these three models The improved ensemble generation strategy for P2-M therefore highlights differences due to ocean initial provides a clear benefit in terms of greatly increased conditions and the ensemble generation strategies. For reliability compared to P2-S and P1.5. both fortnights and all seasons, P1.5 is more reliable This conclusion is also supported by the Brier skill than P2c-S, but less reliable than P2c-M (Figs. 13c and score calculation (Fig. 15), which shows the percentage 13f). P2c-M and P2c-S both use ocean initial conditions improvement in skill compared to a climatological fore- from the PEODAS assimilation, whereas P1.5 uses cast for forecasting above the upper tercile for Tmax in ocean initial conditions from the optimum interpolation the first and second fortnights for P1.5, P2c-S, and P2c-M assimilation system. The differences in reliability and (i.e., all three have exactly the same model configuration skill (the latter is discussed in the next paragraph) be- and ensemble size). For both fortnights there is a clear tween the three models do not, therefore, appear to be improvement in skill in P2c-M compared to P1.5 and related to the quality of the ocean initial conditions P2c-S (the same is true for Tmin and rainfall; not shown). [which are superior using PEODAS; Yin et al. (2011); P2c-M shows 20%–35% increases in skill compared to Zhao et al. (2013b)]. This does not imply that the ocean a climatological forecast in the first fortnight and initial conditions have no impact or are not important. In 10%–15% improvements in the second fortnight, with fact, P2-S shows better skill at predicting seasonal sea improvements focused over western and southeastern surface temperature variations associated with ENSO Australia (Fig. 15). compared to P1.5, directly as a result of the improved ocean initial conditions (Zhao et al. 2013a, manuscript 5. Conclusions submitted to Climate Dyn.). However, on intraseasonal time scales, the main source of the difference is related The production of skillful and reliable probabilistic to the ensemble generation strategy. P2c-S is the least forecasts is a desirable attribute of any forecasting sys- reliable because, as discussed previously, perturbing tem. The new POAMA-2 multiweek forecast system only the ocean initial conditions and not the atmosphere (P2-M) demonstrates greatly improved reliability of does not produce enough ensemble atmospheric spread forecasts of Australian climate for lead times out to one on intraseasonal time scales. P2c-M has improved season, in comparison to the seasonal configuration of

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC 4446 MONTHLY WEATHER REVIEW VOLUME 141

FIG. 15. Debiased BSS (reference score is climatology) for 2-week mean maximum temperature above the upper tercile for (left) fortnight 1 (days 1–14) and (right) fortnight 2 (days 15–18) using all forecast start months for (a),(b) P1.5, (c),(d) P2c-S, and (e),(f) P2c-M. The score reflects the percentage (i.e., multiply the score by 100) improvement in skill compared to a climatological forecast. Grid boxes showing skill better than a climatological forecast by more than 10% are shaded. Areas of negative skill are shown by the dashed contours. The contour interval is 0.05 (i.e., 5%). These three single model systems differ in their ensemble generation strategy, but have the same model configurations and ensemble size (10 members).

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC DECEMBER 2013 HUDSON ET AL. 4447

POAMA-2 (P2-S) and the older POAMA-1.5 (P1.5) method has a better-tuned ensemble compared to the version. This improvement can be attributed to the new lagged ensemble. This does not preclude using a com- ensemble generation scheme used for characterizing bined approach to generate the final ensemble [i.e., uncertainty in initial conditions. This scheme, applied to combining successive times of burst (bred) ensembles to the forecasts in burst ensemble mode (all ensemble create a larger ensemble]. The substantial improvement members initialized at the same time), generates per- in spread/dispersion from using perturbations produced turbed atmosphere and ocean initial conditions using by coupled breeding does not, however, indicate that a coupled-breeding approach. In contrast, P2-S uses these perturbations are optimal in the sense that the full a burst ensemble created from perturbed ocean initial range of possible outcomes is adequately sampled given conditions, but no perturbations are applied to the at- the limited ensemble size. Ongoing experimentation is mosphere; and P1.5 uses time-lagged atmospheric initial aimed at optimizing the method of generating pertur- conditions to generate the ensemble, but the ocean ini- bations using coupled breeding, but the gains made to tial condition is the same in all members. The coupled- date already justify the approach. We also note that our breeding scheme results in increased ensemble spread in comparison between the breeding method (i.e., P2c-M) the first month of the forecasts compared to either the and the lagged method (P1.5) is not a clean comparison, P2-S approach or the lagged-ensemble approach of P1.5, since the ocean assimilation systems are different in the and this reduces the underdispersive nature of the two systems (note that P2c-M and P1.5 are identical forecasts. Although the major impact of the new en- models with comparable ensemble sizes). We are cur- semble generation scheme is to improve the reliability of rently running additional experiments that isolate the forecasts, the P2-M system also exhibits improved impact of the ensemble generation method, including forecast discrimination and skill in fortnights 1 and 2 answering other questions such as the importance of compared to P2-S and P1.5, as shown by the ROC area having flow-dependent perturbations. and Brier skill score maps. It also demonstrates in- Unlike for seasonal time scales, on intraseasonal time creased skill in predicting important drivers of intra- scales there appears to be little benefit from the POAMA-2 seasonal climate, namely the MJO and the SAM. The multimodel approach. This is because each component benefits of the coupled ensemble generation approach model version uses the same set of perturbed initial are not restricted to the intraseasonal time scale. Fore- conditions. As such, the benefit of using different model casts of zero-lead seasonal mean rainfall and tempera- formulations is only evident at much longer lead times ture over Australia are more reliable and skillful in P2-M for which ensemble spread develops as a result of model compared to P2-S. For this reason, the BoM’s operational differences (Wang et al. 2011). Future development of seasonal forecasts from POAMA are now based on the POAMA will include using different perturbed initial P2-M system configuration, thus having one system pro- conditions for each model version, to better simulate the ducing a seamless set of products from weeks to seasons. ensemble spread at these time scales. The results of our study demonstrate the possible Like most intraseasonal–seasonal forecast systems, shortcomings of using lagged atmospheric initial condi- P2-M still has separate data assimilation schemes for tions as a method of generating perturbed initial con- the ocean, land, and atmosphere components of the ditions. Lagged-atmospheric initial conditions, which is global coupled model. Separate assimilation can lead a common and simple approach to generating pertur- to initialization shock in the forecasts, as the initial bations (e.g., Jones et al. 2011; Pegion and Sardeshmukh coupled model state is not in dynamical balance. In 2011; Liu et al. 2012), can result in underdispersive contrast, a fully coupled data assimilation scheme would forecasts (assuming no additional method of perturbing likely reduce the shock and improve forecasts. The en- the forecasts), even though the magnitudes of the initial semble generation and initialization strategy devel- perturbations are comparable or larger than the analysis oped for P2-M provides a solution somewhat closer to uncertainty. Presumably, the increased spread and that which would be obtained with fully coupled as- faster growth in spread from the breeding method are similation and is closely aligned with the long-term because the method is more effective at emphasizing the goal of fully coupled data assimilation with POAMA. fast-growing errors. The breeding method is designed to Since the breeding scheme is performed with the coupled perturb those modes that grow the fastest during the model, the perturbations that are used to initialize the analysis cycle leading up to the initial time and are thus forecasts are in dynamical balance. This new strategy the modes most likely to dominate the analysis errors represents a significant milestone in the development (e.g., see also Toth and Kalnay 1993). Our study has of the POAMA forecast system. How best to initialize shown, through a comparison of the ensemble spread the coupled system for successful intraseasonal pre- with the error of the ensemble mean, that the breeding diction is still largely an unanswered question. This

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC 4448 MONTHLY WEATHER REVIEW VOLUME 141 work contributes to advancing our knowledge on this Liu, X., S. Yang, A. Kumar, S. Weaver, and X. Jiang, 2012: Di- issue. agnostics of subseasonal prediction biases of the Asian sum- mer monsoon by the NCEP Climate Forecast System. Climate Acknowledgments. This work was supported by the Dyn., doi:10.1007/s00382-012-1553-3, in press. Manabe, S., and J. Holloway, 1975: The seasonal variation of the Managing Climate Variability Program of the Grains hydrological cycle as simulated by a global model of the at- Research and Development Corporation. Mei Zhao, mosphere. J. Geophys. Res., 80, 1617–1649. Wasyl Drosdowsky, and two anonymous reviewers are Marshall, A. G., D. Hudson, M. C. Wheeler, H. H. Hendon, and thanked for their useful comments in the preparation of O. Alves, 2011a: Assessing the simulation and prediction of this manuscript. rainfall associated with the MJO in the POAMA seasonal forecast system. Climate Dyn., 37, 2129–2141. ——, ——, ——, ——, and ——, 2011b: Evaluating key drivers of REFERENCES Australian intraseasonal climate variability in POAMA-2: A Arribas, A., and Coauthors, 2011: The GloSea4 ensemble pre- progress report. CAWCR Research Letters, No. 7, 10–16. diction system for seasonal forecasting. Mon. Wea. Rev., 139, ——, ——, ——, ——, and ——, 2011c: Simulation and prediction 1891–1910. of the southern annular mode and its influence on Australian Brunet, G., and Coauthors, 2010: Collaboration of the weather and intra-seasonal climate in POAMA. Climate Dyn., 38, 2483–2502. climate communities to advance subseasonal to seasonal Mason, S. J., and N. E. Graham, 1999: Conditional probabilities, prediction. Bull. Amer. Meteor. Soc., 91, 1397–1406. relative operating characteristics, and relative operating levels. Cai, M., E. Kalnay, and Z. Toth, 2003: Bred vectors of the Zebiak– Wea. Forecasting, 14, 713–725. Cane model and their application to ENSO predictions. ——, and ——, 2002: Areas beneath the relative operating char- J. Climate, 16, 40–55. acteristics (ROC) and relative operating levels (ROL) curves: Colman, R., and Coauthors, 2005: BMRC Atmospheric Model Statistical significance and interpretation. Quart. J. Roy. Me- (BAM) version 3.0: Comparison with mean climatology. BMRC teor. Soc., 128, 2145–2166. Research Rep. 108, Bureau of Meteorology, 66 pp. ——, and D. Stephenson, 2008: How do we know whether seasonal Doblas-Reyes, F. J., C. A. S. Coelho, and D. B. Stephenson, 2008: climate forecasts are any good? Seasonal Climate: Forecasting How much does simplification of probability forecasts reduce and Managing Risk, A. Troccoli et al., Eds., NATO Science forecast quality? Meteor. Appl., 15, 155–162. Series, Springer Academic, 259–289. Gong, D., and S. Wang, 1999: Definition of Antarctic Oscillation Mills, G. A., G. Weymouth, D. A. Jones, E. E. Ebert, and M. J. index. Geophys. Res. Lett., 26, 459–462. Manton, 1997: A national objective daily rainfall analysis Gottschalck J, and Coauthors, 2008: Madden-Julian oscillation system. BMRC Techniques Development Rep. 1, Bureau of forecasting at operational modelling centres. CLIVAR Ex- Meteorology, 30 pp. € changes, Vol. 13, No. 4, International CLIVAR Project Office, Muller, W. A., C. Appenzeller, F. J. Doblas-Reyes, and M. A. Southampton, United Kingdom, 18–19. Liniger, 2005: A debiased ranked probability skill score to Hendon, H. H., D. W. J. Thompson, and M. C. Wheeler, 2007: evaluate probabilistic ensemble forecasts with small ensemble Australian rainfall and surface temperature variations asso- sizes. J. Climate, 18, 1513–1523. ciated with the Southern Hemisphere annular mode. J. Cli- Murphy, A. H., 1973: A new vector partition of the probability mate, 20, 2452–2467. score. J. Appl. Meteor., 12, 595–600. Hudson, D., O. Alves, H. H. Hendon, and A. G. Marshall, 2011a: Oke, P. R., A. Schiller, D. A. Griffin, and G. B. Brassington, 2005: Bridging the gap between weather and seasonal forecasting: Ensemble data assimilation for an eddy-resolving ocean model Intraseasonal forecasting for Australia. Quart. J. Roy. Meteor. of the Australian region. Quart. J. Roy. Meteor. Soc., 131, 3301– Soc., 137, 673–689. 3311. ——, ——, ——, and G. Wang, 2011b: The impact of atmospheric Palmer, T., R. Buizza, R. Hagedorn, A. Lawrence, M. Leutbecher, initialisation on seasonal prediction of tropical Pacific SST. and L. Smith, 2006: Ensemble prediction: A pedagogical Climate Dyn., 36, 1155–1171. perspective. ECMWF Newsletter, No. 106, ECMWF, Reading, ——, A. G. Marshall, and O. Alves, 2011c: Intraseasonal fore- United Kingdom, 11–17. casting of the 2009 summer and winter Australian heat waves Pegion, K., and P. D. Sardeshmukh, 2011: Prospects for improving using POAMA. Wea. Forecasting, 26, 257–279. subseasonal predictions. Mon. Wea. Rev., 139, 3648–3666. Jones, C., J. Gottschalck, L. M. V. Carvalho, and W. Higgins, 2011: Rashid, H. A., H. H. Hendon, M. C. Wheeler, and O. Alves, 2010: Influence of the Madden–Julian oscillation on forecasts of Predictability of the Madden–Julian oscillation in the POAMA extreme precipitation in the contiguous United States. Mon. dynamical seasonal prediction system. Climate Dyn., 36, 649– Wea. Rev., 139, 332–350. 661. Jones, D. A., and B. C. Trewin, 2000: On the relationships between Risbey, J. S., M. J. Pook, P. C. McIntosh, M. C. Wheeler, and H. H. the El Nino–Southern~ Oscillation and Australian land surface Hendon, 2009: On the remote drivers of rainfall variability in temperature. Int. J. Climatol., 20, 697–719. Australia. Mon. Wea. Rev., 137, 3233–3253. Kalnay, E., and Coauthors, 1996: The NCEP/NCAR 40-Year Re- Saji, N. H., B. N. Goswami, P. N. Vinayachandran, and T. Yamagata, analysis Project. Bull. Amer. Meteor. Soc., 77, 437–471. 1999: A dipole mode in the tropical Indian Ocean. Nature, 401, Kanamitsu, M., W. Ebisuzaki, J. Woollen, S.-K. Yang, J. J. Hnilo, 360–363. M. Fiorino, and G. L. Potter, 2002: NCEP–DOE AMIP-II Schiller, A., J. S. Godfrey, P. McIntosh, and G. Meyers, 1997: A Reanalysis (R-2). Bull. Amer. Meteor. Soc., 83, 1631–1643. global ocean general circulation model climate variability Kang, I.-S., and H.-M. Kim, 2010: Assessment of MJO pre- studies. CSIRO Marine Research Rep. 227, 60 pp. dictability for boreal winter with various statistical and dy- ——, ——, P. C. McIntosh, G. Meyers, N. R. Smith, O. Alves, namical models. J. Climate, 23, 2368–2378. G. Wang, and R. Fiedler, 2002: A new version of the Australian

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC DECEMBER 2013 HUDSON ET AL. 4449

Community Ocean Model for seasonal climate prediction. ——, D. Hudson, Y. Yin, O. Alves, H. Hendon, S. Langford, CSIRO Marine Research Rep. 240, 79 pp. G. Liu, and F. Tseitkin, 2011: POAMA-2 SST skill assessment Smith, N. R., J. E. Blomley, and G. Meyers, 1991: A univariate and beyond. CAWCR Research Letters, No. 6, 40–46. statistical interpolation scheme for subsurface thermal analy- Wang, X., C. Bishop, and S. Julier, 2004: Which is better, an en- ses in the tropical oceans. Prog. Oceanogr., 28, 219–256. semble of positive–negative pairs or a centered spherical Stockdale, T. N., 1997: Coupled ocean–atmosphere forecasts in the simplex ensemble? Mon. Wea. Rev., 132, 1590–1605. presence of climate drift. Mon. Wea. Rev., 125, 809–818. Wei,M.,Z.Toth,R.Wobus,andY.Zhu,2008:Initialpertur- Thompson, D. W. J., and J. M. Wallace, 2000: Annular modes in the bations based on the ensemble transform (ET) technique in extratropical circulation. Part I: Month-to-month variability. the NCEP global ensemble forecast systems. Tellus, 60A, J. Climate, 13, 1000–1016. 62–79. Toth, Z., and E. Kalnay, 1993: at the NMC: Weigel, A. P., M. A. Liniger, and C. Appenzeller, 2007: The dis- The generation of perturbations. Bull. Amer. Meteor. Soc., 74, crete Brier and ranked probability skill scores. Mon. Wea. 2317–2330. Rev., 135, 118–124. ——, and ——, 1996: Climate ensemble forecasts: How to create Weisheimer, A., T. N. Palmer, and F. J. Doblas-Reyes, 2011: As- them? Idojaras, 100, 43–52. sessment of representations of model uncertainty in monthly ——, and ——, 1997: Ensemble forecasting at NCEP and the and seasonal forecast ensembles. Geophys. Res. Lett., 38, breeding methods. Mon. Wea. Rev., 125, 3297–3319. L16703, doi:10.1029/2011GL048123. ——, O. Talagrand, G. Candille, and Y. Zhu, 2003: Probability and Wheeler, M. C., and H. H. Hendon, 2004: An all-season real-time ensemble forecasts. Forecast Verification: A Practitioner’s Guide multivariate MJO index: Development of an index for moni- in Atmospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., toring and prediction. Mon. Wea. Rev., 132, 1917–1932. Wiley, 137–163. ——, ——, S. Cleland, H. Meinke, and A. Donald, 2009: Impacts of ——, M. Pena,~ and A. Vintzileos, 2007: Bridging the gap between the Madden–Julian oscillation on Australian rainfall and cir- weather and climate forecasting: Research priorities for intra- culation. J. Climate, 22, 1482–1498. seasonal prediction. Bull. Amer. Meteor. Soc., 88, 1427–1429. Wilks, D., 2006: Statistical Methods in Atmospheric Sciences. 2nd Uppala, S. M., and Coauthors, 2005: The ERA-40 Re-Analysis. ed. Academic Press, 627 pp. Quart. J. Roy. Meteor. Soc., 131, 2961–3012. Yang, S.-C., M. Cai, E. Kalnay, M. Rienecker, G. Yuan, and Valcke, S., L. Terray, and A. Piacentini, 2000: OASIS 2.4 Ocean Z. Toth, 2006: ENSO bred vectors in coupled ocean–atmosphere Atmospheric Sea Ice Soil users guide, version 2.4. CERFACS general circulation models. J. Climate, 19, 1422–1436. TR/CMGC/00-10, 85 pp. Yin, Y., O. Alves, and P. R. Oke, 2011: An ensemble ocean data Vitart, F., and F. Molteni, 2011: Simulation of the Madden–Julian assimilation system for seasonal prediction. Mon. Wea. Rev., oscillation and its impact over Europe in ECMWF’s monthly 139, 786–808. forecasts. ECMWF Newsletter, No. 126, ECMWF, Reading, Zhao, M., H. H. Hendon, O. Alves, Y. Yin, and D. Anderson, United Kingdom, 12–17. 2013b: Impact of salinity constraints on the simulated mean ——, and Coauthors, 2008: The new VarEPS-monthly forecasting state and variability in a coupled seasonal forecast model. system: A first step towards seamless prediction. Quart. J. Roy. Mon. Wea. Rev., 141, 388–402. Meteor. Soc., 134, 1789–1799. Zhong, A., O. Alves, H. H. Hendon, and L. Rikus, 2006: On aspects Wang, G., O. Alves, and N. Smith, 2005: BAM3.0 tropical surface of the mean climatology and tropical interannual variability in flux simulation and its impact on SST drift in a coupled model. the BMRC Atmospheric Model (BAM 3.0). BMRC Research BMRC Research Rep. 107, Bureau of Meteorology, 39 pp. Rep. 121, Bureau of Meteorology, 43 pp.

Unauthenticated | Downloaded 10/05/21 05:21 PM UTC