Field Significance of Performance Measures in the Context of Regional Climate Model Evaluation
Total Page:16
File Type:pdf, Size:1020Kb
Theor Appl Climatol DOI 10.1007/s00704-017-2100-2 ORIGINAL PAPER Field significance of performance measures in the context of regional climate model evaluation. Part 1: temperature Martin Ivanov1 · Kirsten Warrach-Sagi2 · Volker Wulfmeyer2 Received: 4 July 2016 / Accepted: 13 March 2017 © The Author(s) 2017. This article is published with open access at Springerlink.com Abstract A new approach for rigorous spatial analysis of Urban areas in concave topography forms have a warm sum- the downscaling performance of regional climate model mer bias due to the strong heat islands, not reflected in the (RCM) simulations is introduced. It is based on a multiple observations. WRF-NOAH generates appropriate fine-scale comparison of the local tests at the grid cells and is also features in the monthly temperature field over regions of known as “field” or “global” significance. New performance complex topography, but over spatially homogeneous areas measures for estimating the added value of downscaled data even small biases can lead to significant deteriorations rel- relative to the large-scale forcing fields are developed. The ative to the driving reanalysis. As the added value of global methodology is exemplarily applied to a standard EURO- climate model (GCM)-driven simulations cannot be smaller CORDEX hindcast simulation with the Weather Research than this perfect-boundary estimate, this work demonstrates and Forecasting (WRF) model coupled with the land surface in a rigorous manner the clear additional value of dynamical ◦ modelNOAHat0.11 grid resolution. Monthly tempera- downscaling over global climate simulations. The evalu- ture climatology for the 1990–2009 period is analysed for ation methodology has a broad spectrum of applicability Germany for winter and summer in comparison with high- as it is distribution-free, robust to spatial dependence, and resolution gridded observations from the German Weather accounts for time series structure. Service. The field significance test controls the proportion of falsely rejected local tests in a meaningful way and is robust to spatial dependence. Hence, the spatial patterns of the 1 Introduction statistically significant local tests are also meaningful. We interpret them from a process-oriented perspective. In win- Climate change will induce changes not only in temperature ter and in most regions in summer, the downscaled distri- statistics but also in the spatial and temporal precipitation butions are statistically indistinguishable from the observed patterns. The expected significant societal and environmen- ones. A systematic cold summer bias occurs in deep river tal impacts call for further assessment and improvement of valleys due to overestimated elevations, in coastal areas climate projections (Trenberth et al. 2003;Schar¨ et al. 2004; due probably to enhanced sea breeze circulation, and over O’Gorman and Schneider 2009; Hartmann et al. 2013). large lakes due to the interpolation of water temperatures. Global climate models (GCMs) are the primary source of climate change information. However, the statistics they provide are not reliable on the fine scales required for impact Martin Ivanov assessment. This fundamental scale gap is the target of the [email protected] so-called downscaling or regionalisation methods. One-way nesting of regional climate models (RCMs) is the compu- 1 Department of Geography, Climatology, Climate Dynamics tationally most parsimonious but nevertheless equally well and Climate Change, Justus-Liebig University of Gießen, performing physically based downscaling method. Typical Senckenbergstraße 1, D-35390 Gießen, Germany RCM resolutions are currently 10–50 km, but simulations ∼ 2 Institute of Physics and Meteorology, University of Hohenheim, with grid resolutions down to 1 km with explicit resol- Garbenstr. 30, D-70593 Stuttgart, Germany ving of convection are becoming increasingly available M. Ivanov et al. (Prein et al. 2015). For more details on the state-of-the-art at 0.11◦ grid resolution with the WRF-NOAH model system regional climate modelling, the reader is referred to Laprise (Warrach-Sagi et al. 2013) over the territory of Germany, (2008) and Rummukainen (2010). where high-resolution gridded observation data products A good RCM performance for past climate periods are available. The current work, part 1, is devoted to 2 m is assumed to be necessary for an adequate performance monthly temperature. under different climate conditions in the future. Therefore, We employ distribution-based evaluation statistics and RCM hindcast simulations must be evaluated against obser- their relative versions to quantify the downscaling skill vation-based products. The evaluation of RCMs requires an relative to the driving reanalysis. Despite being direct objec- observational reference of the same spatial and temporal tive measures of added value, relative performance metrics resolution, which is not always available, and the problem are still underapplied. Most previous works estimate the aggravates with the increasing RCM resolution. To min- relative skill by comparing domain-aggregated scalar eval- imise biases due to misrepresentation of the large-scale uation statistics, which precludes spatial analysis, and/or forcing and focus only on RCM-related biases (e.g. due by visually inspecting the fields of the evaluation statistics to deficiencies of RCM physics or artefacts of the nest- for the downscaled and the larger-scale driving data (e.g. ing procedure itself), the evaluation must be performed in Duffy et al. 2006; Feser 2006; Sotillo et al. 2006; Buonomo a “perfect-boundary” setting, in which an RCM is nested et al. 2007; Sanchez-Gomez et al. 2009;Prommel¨ et al. 2010; in a global large-scale reanalysis (Christensen et al. 1997; Di Luca et al. 2012; Kendon et al. 2012; Cardoso et al. 2013; Pan et al. 2001). The perfect-boundary performance is bet- Chan et al. 2013; Pearson et al. 2015; Torma et al. 2015), ter than the performance the same RCM would have were it which is more or less subjective. Relative versions of few driven by a GCM simulation. In this sense, perfect-boundary distribution-based performance measures were utilised by evaluation yields an upper boundary for RCM skill. The Winterfeldt and Weisse (2009) and Vautard et al. (2013) additional information provided by the regional simulation only for individual locations, and by Winterfeldt et al. beyond the scales of the driving fields is referred to as (2011) and Dosio et al. (2015) on a gridpoint basis. Our added value (Laprise 2008;DiLucaetal.2012). Due to priority is to study in more detail the spatial structure of the data assimilation, reanalyses excel GCM runs and hence model performance as represented by the spatial patterns of are harder to outperform. Therefore, the added value of a grid-cell statistics. We propose general formulae for deriv- reanalysis-driven hindcast is a lower bound of the added ing relative versions of performance metrics. Many of the value that the same RCM can have in a GCM-driven setting relative measures are new or applied for the first time in the (Prommeletal.¨ 2010). context of RCM added value analysis. Within the EU-funded projects PRUDENCE (Christensen As performance measures are subject to sampling vari- and Christensen 2007) and its follower ENSEMBLES ability, they should be physically interpreted only after (van der Linden and Mitchell 2009), RCMs with grid the accompanying sampling uncertainty has been quanti- resolutions of 20–50 km were applied and evaluated for fied. Within the frequentist approach to statistical inference the European region (e.g. Jacob et al. 2007;Jaegeretal. (Jolliffe 2007) this can be done via confidence intervals, 2008). The ENSEMBLES project was succeeded by the which is the popular approach (e.g. Elmore et al. 2006; COordinated Regional climate Downscaling EXperiment Buonomo et al. 2007; Sanchez-Gomez et al. 2009; Kendon (CORDEX) (Giorgi et al. 2009), which focusses on a sys- et al. 2012,Chanetal.2013) or more directly via hypothe- tematical evaluation of RCM performance in an ensemble sis testing as in, e.g. Duffy et al. (2006) and Cardoso et al. of perfect-boundary conditions experiments nested within (2013) as well as in this work. Here, at each grid cell, the ERA-Interim reanalysis (e.g. Dee et al. 2011)forthe we estimate the p value for each test statistic in a Monte 1989–2008 period. Within EURO-CORDEX, the European Carlo framework. The problem of test multiplicity is tack- branch of CORDEX, an ensemble of such hindcast simu- led by determining the “field” significance (e.g. Livezey and lations has been created in order to provide comparisons Chen 1983; Ventura et al. 2004; Wilks 2006a)asimple- at horizontal grid increments of 0.44◦ (∼50 km) and 0.11◦ mented by the false discovery rate (FDR) approach of (∼12 km). Benjamini and Hochberg (1995). As FDR is generally more RCM performance is quantified by statistical measures powerful and robust to spatial correlations than alternative also known as performance metrics. Relative versions of multiple comparison methods, it provides a more meaning- these metrics allow comparison of RCM skill against that of ful spatial pattern of local rejections (Wilks 2006a). The the large-scale forcing data. The purpose of this study is to latter comprises spatial configurations of locations at which introduce a new approach for rigorous spatial analysis of the the values of the evaluation statistics are in breach with the downscaling performance and develop new relative perfor- null hypothesis, that is, which are highly unlikely to have mance metrics. This is the first of two papers that exemplarily occurred by chance. The analysis focusses on these patterns apply the methodology to a standard EURO-CORDEX run rather than on the magnitudes of the evaluation statistics as Field significance of performance measures. Part 1: temperature is conventionally the case. Most evaluation studies estimate Sturges algorithm (Sturges 1926). To achieve a better reso- statistical significance for domain-aggregated scalar mea- lution, the bin size is chosen as the minimum over all bin sures (e.g.