The Effects of Missing Data
Total Page:16
File Type:pdf, Size:1020Kb
1.2 REPLACING MISSING DATA FOR ENSEMBLE SYSTEMS Tyler McCandless*, Sue Ellen Haupt, George Young 1 The Pennsylvania State University A major problem for statistical predictive system 1. INTRODUCTION developers is obtaining enough input data from Numerical Weather Prediction (NWP) models for Missing data presents a problem in many fields, developing, or training, new post-processing methods. including meteorology. The data can be missing at Such post-processing, or calibrating, of NWP forecasts random, in recurring patterns, or in large sections. has been shown to decrease errors in forecasting since Incomplete datasets can lead to misleading conclusions, the introduction of Model Output Statistics (MOS) as demonstrated by Kidson and Trenberth (1988), who (Glahn and Lowry 1972). For these statistical post- assessed the impact of missing data on general processing techniques to make these improvements in circulation statistics by systematically decreasing the forecast skill, however, requires training them on amount of available training data. They determined that archived forecasts and validation data. Enough data the ratio of the Root Mean Square Error (RMSE) in the must be available for the resulting technique to be monthly mean to the daily standard deviation was two to stable. Thus, imputing any missing data will provide three times higher when the missing data was spaced larger training and validation datasets for this process. randomly compared to spaced equally, and RMSE The advent of ensembles of forecasts has led to increased by up to a factor of two when the missing data new methods of post-processing. Post-processing occurred in one block. Therefore, the spacing of the ensemble temperature data can produce an even more missing data can have an impact on statistical analyses. accurate deterministic forecast than the raw model It is useful to consider how to best replace, or forecast, as well as provide information on uncertainty impute, the missing data. Various methods have been (Hamill et al. 2000). Some statistical calibration considered, Vincent and Gullet (1999) found that highly techniques give performance-based weights to correlated neighbor stations can be used to interpolate ensemble members, thus giving more weight to an missing data in Canadian temperature datasets. ensemble member that produces relatively higher skill Schneider (2001) used an expectation maximization than other ensemble members (Woodcock and Engel (EM) algorithm for Gaussian data and determined that it 2005, Raftery et al 2005). After computing performance is applicable to typical sets of climate data and that it weights from a set of previous ensemble forecasts, leads to more accurate estimates of the missing values these weights are used to combine the current than a conventional non-iterative imputation technique. ensemble forecasts to form a single deterministic Richman et al. (2008) showed that iterative imputation forecast. If some ensemble members are missing, techniques used to fill in systematically removed data however, a method must be devised to deal with the from both linear and nonlinear synthetic datasets situation. Previous studies have simply excluded cases produced lower Mean Square Error (MSE) than other with missing ensemble forecast data (Greybush et al. methods studied. Three iterative imputation techniques 2008). In a real-time forecasting situation, however, produced similar results, and they all had lower errors excluding missing forecast data limits the forecast skill than using the mean value, case deletion, or simple of the ensemble and limits the uncertainty information linear regression. However, Kemp et al. (1983) available from the ensemble. Currently, if an ensemble compared seven different methods of replacing missing member is missing from a forecast, the National values and found that between-station regression Weather Service (NWS) simply excludes it from the yielded the smallest errors (estimated – actual) in an ensemble (Richard Grumm, 2008 personal analysis of temperature observing networks. They communication), which limits the spread and utility of found that linear averaging within station and a four-day the ensemble. As we move toward more complex moving average were not as accurate as between weighting methods, such an approach may no longer be station regressions, although linear averaging had feasible. Unlike replacing missing observation data, smaller errors than the four-day moving average. * replacing missing forecast data must preserve each ensemble member’s characteristics in order to portray its impact on the consensus forecast accurately, so that the computed performance based weight still applies. This current study uses the same data and post- * Corresponding author address: Applied Research processing algorithms as Greybush et al. (2008). That Laboratory, P.O. Box 30, The Pennsylvania State study developed and evaluated regime dependent post- University, State College, PA 16804-0030, processing ensemble forecasting techniques. Greybush email: [email protected] et al. (2008) used the University of Washington Mesoscale Ensemble, a varied-model (differing physics to be random. The cases with missing data are deleted and parameterization schemes) multi-analysis ensemble in the original research (Greybush et al. 2008). More with eight members, to analyze several post-processing sophisticated methods are tested here. methods of forecasting two-meter temperature. A large The missing data are replaced before performing amount of this data was missing, and Greybush et al. simple linear bias-correction. The full process of (2008) used case deletion in the training and analysis of missing data imputation, bias-correction, and regime- the post-processing methods. The performance of the based performance weighted averaging is shown as a post-processing methods was tested on a year of data data flowchart in Figure 1. for four locations in the Pacific Northwest. The first post-processing method used here is a performance-weighted average using a 10-day sliding window. The relative performance of each of the eight ensemble members in the previous 10-days is used to compute forecast weights. Although this is not directly a regime-dependent forecast method, a seven to ten day period reflects the typical persistence of atmospheric flow regimes in the Pacific Northwest (Greybush et al. 2008). A second post-processing method used here is the K-means regime clustering method. Clustering techniques are applicable when the instances are to be divided into natural, or mutually exclusive, groups (Witten and Frank 2005). The parameter, K, is the number of clusters that are sought. Greybush et al. (2008) showed that the K-means regime clustering post- processing method produced the lowest MAE in two- Figure 1. Flowchart of the process of imputing the meter temperature forecasts when compared to regime missing data and the effects on the MAE of statistical regression, a genetic algorithm regime method, and post-processing techniques. different windowed performance-weighted averages. The goal of the current research is to determine the After bias-correcting the individual ensemble effects of replacing the missing data on these post- members with simple linear regression, weights are processing methods. The same dataset and model assigned to the ensemble members. The 10-day development code are used to isolate the effects of performance-weighted window and K-means regime replacing the missing data. We test the data imputation clustering both calculate separate performance weights methods for four point locations for which we have for the ensemble members. The deterministic forecast verification observations. Section 2 describes the is a combination of the weighted ensemble members methods used to replace the missing data. Section 3 and is compared to the verifying observations to shows the results of the various methods of replacing produce a consensus forecast MAE for each day. For a the missing data on a 10-day performance window post- fair assessment of the methods, independent processing technique and a K-means regime clustering verification data is necessary; therefore, the data is method. Section 4 summarizes and analyzes the randomly divided into a dependent (training) dataset results. (2/3 of the days) and an independent (verification) dataset (1/3 of the days). The performance weights are 2. METHODS calculated on the training dataset and applied to the independent dataset to produce the mean MAE of the This study uses data from an eight member forecasts from the independent dataset. This process is ensemble of daily 48-hour surface temperature repeated 50 times and averaged to avoid overfitting and forecasts for a year (365 consecutive days). Four point biasing the results via sampling error. locations from the Pacific Northwest are used: Portland, Finally, the performance of two weighted average Oregon; Astoria, Oregon, Olympia, Washington; and post-processing is evaluated to quantify the benefit of Redmond, Washington. All of the verification data are missing data imputation. The post-processing available. However, 260 of the 2920 (8.9%) techniques calculate averaging weights for each temperature forecasts are missing from the eight ensemble member based upon that ensemble member ensemble. Out of the 365 days, 151 (or 41.4%) member’s forecasting performance in a 10-day window have at least one ensemble member missing. Although or under a specific weather regime. Therefore, the data there