<<

Generalizing Model Output Statistics With Recurrent Neural Networks

Joan Creus-Costa Departments of Computer Science and Physics Stanford University [email protected]

Abstract

We present an architecture using a single neural network to learn a mapping between the raw outputs of a numerical prediction model and model output statistics of interest, in this case temperature and dewpoint temperature. Instead of learning a model for each of thousands of stations, we enable data reuse and better learning by using a single recurrent neural network that learns the biases based on the errors of the model in the previous days. This greater generalization enables for better adaptation to changes in NWP models and makes it easier to add new locations for which predictions of model output statistics are desired. It achieves comparable performance to NOAA’s state of the art model output statistics, with a root-mean-square error of 1.88 and 1.70 degrees Celsius in predicting temperature and dewpoint 24 hours in advance.

1 Introduction

Global numerical weather prediction models, such as the American GFS or the European ECMWF, produce outputs for various variables—such as humidity or temperature—on a coarse grid with a resolution of approximately 27 km. These raw forecasts, however, do not directly translate to weather forecasts as we know them. The issue with them is threefold: one, it does not capture all the variables humans care about (such as probability of rain, or surface-level quantities); second, it can have a variety local and global biases and systematics; and thirdly, its limited resolution means interpolation is required to produce predictions of meaningful quantities at the points of interest [1]. As a result, a post-processing step, called Model Output Statistics (MOS) is applied to the raw model grid’s values to generate the desired quantities at the desired points. In order to generate this map from the numerical prediction to the refined, debiased forecast, a training set is gathered by looking at ground truth data collected at weather stations around the country, and then a regression is performed between the model outputs and ground truth to learn the biases and corrections. Traditionally, these model output statistics take the form of a multivariate linear regression (with some papers trying out ensemble probabilistic techniques [2], random forests, or neural networks [3]), usually applied independently to each weather station that has ground truth data. An interesting observation is the fact that, by the nature of what we’re trying to learn—localized biases and local climatology [1]—it would seem that the amount of training data that can be used for the models is fairly limited, because it scales merely with the cumulative time a given weather station has been operating, instead of the total data collected, which scales with the product of time elapsed and number of stations. In other words, having to create different models for each station means that data cannot be reused, even if intuitively it might make sense that some low-level characteristics are shared. On top of that, the amount of data that can be used for a given station is limited, too, since as the numerical weather model changes, the distribution of the errors shifts, which means that the regression cannot use old datapoints, [4] since the distribution of the error shifts over time. From this observations we motivate a desire to create a single model that can perform the MOS corrections for any station, instead of on a per-station basis. The benefits, besides potentially better accuracy coming from the increased training size, include the fact that MOS can adapt to changes in the NWP model much faster: if it previously took T days’ worth of data to collect enough data to recalibrate each station’s coefficients after a model change, with a unified

CS230: Deep Learning, Fall 2019, Stanford University, CA. (LateX template borrowed from NIPS 2017.) Figure 1: Weather stations providing ground truth data, representing the intersection of ISD stations and MOS stations that had valid recorded measurements over recent years. The colors represent how they were used as part of RNN learning: training in blue, dev in red, and test in green. For the station-by-station tests, only the green stations were used. system data could be theoretically collected in T/N days (with N being the total number of stations, which is a few thousand). This realistically might mean that a few weeks may suffice. In particular, this paper will focus in predicting temperature and dewpoint temperature 24 hours ahead, based on features from the raw GFS model run.

2 Related work Most of the previous literature has used traditional learning methods, such as multivariate linear regression. For instance, in [5], one equation per station per predictand is produced as the output of their algorithm that outputs model output statistics for various air quality outputs. Another family of algorithms looks at ensemble outputs of the weather models and outputs probability distributions, as in [2]. This paper will consider the (simpler) problem of merely giving a single output rather than a full distribution. Neural networks have been used for model output statistics before. In [6], one neural network is trained for each of 31 stations in order to predict temperature. Nine input features are normalized and fed to a single-hidden-layer neural network with logistic activation. The number of hidden is varied and they find that the optimal number ranges from zero to eight depending on the station. Another example, applied to solar radiation forecasts, is [7]. Perhaps the most relevant and interesting related work was the 2018 paper in [3]. It creates ensemble forecast using neural networks, but in a departure from the previous work mentioned earlier (which mostly relied on a simple single hidden layer), it uses a more modern architecture, and tries to train a single big network. With a similar motivation of allowing station-specific information while training a single network, the paper uses an embedding—commonly found in natural language processing tasks—of size nemb = 2 to capture station-specific biases. That is, each of the 537 stations considered had two latent features that are learned and then go into the neural network. By combining that and adding auxiliary features besides temperature, it was able to beat other benchmark model output statistics. This interesting approach still used stations as explicit inputs (in that the station selects which embedding is used), but can use a much larger training corpus since it’s otherwise a single network. In this work, we will try to extend those ideas and remove the explicit station dependence by using a recurrent network instead that learns the biases based on the previous few days. In the interest of full disclosure, the author is not experienced in the field of meteorology and the above represents their best efforts in understanding developments and trends in the fields.

3 Dataset and Features Building a dataset and benchmark for the proved to be a challenging task, with many distinct datasets that required extensive post-processing and error checking. Due to the large amounts of data required to process—over a terabyte compressed—we created a custom framework for downloading and processing large amounts of data on a single-node,

2 48-core, 96-thread machine. This framework scaled well and let us process all the data required in parallel and fast, with caching proving particularly useful in the early stages of development. The datasets were mostly acquired from NOAA’s servers that have historical data and some historical predictions, with about 600 lines of Python. They are listed below:

1. Truth measurements performed by weather stations across the United States. These are available sometimes going as far as year 1901, but only recent measurements (from 2014 on) were downloaded since NWP models change every few years. These come from the Integrated Surface Database (ISD) [8]. 2. predictions on a grid. For each GFS prediction cycle (which runs four times a day, at times 0, 6, 12, and 18 hours from midnight UTC), we download the forecasts every 3 hours up to 24 hours in advance, and we extract variables deemed relevant for our task at hand. The list was crafted based on the features used by [3] in their model output statistics, and includes a total of 19 measurements for each point in the grid. These are: 2-meter temperature, 2-meter dewpoint temperature, 2-meter relative humidity, convective available potential energy, latent and sensible heat net flux; downward short-wave, downward long-wave, upward short-wave, and upward long-wave radiation fluxes; surface pressure, grid elevation, volumetric soil moisture at 0 meters depth, geopotential height at 500 and 850 hPa, and the u and v components of at 500 hPa. On top of that, for the recurrent model, we add latitude, longitude, elevation, and fraction of the year (0 on January 1, and 1 on December 31), and the previous error, explained in Section 4.3. A post-processing script extracts those variables and saves them to binary matrices (indexed by the filesystem), which allows for very quick retrieval of values by using Linux memory maps supported by NumPy, with a custom function written to interpolate to a particular set of coordinates. Due to limited data on the NOAA website, only forecasts after June 2018 could be downloaded. The author spent some time trying to download a longer historical set from NCEP’s Research Data Archive but wasn’t successful; using the more limited dataset from the second half of 2018 on was deemed to be enough after getting early results. 3. The model output statistics based on GFS that NOAA themselves generates using their methods (NOAA-GFS- MOS). These are bulletins released every six hours that contain forecasts for the next few days on 3 hour intervals, and represents decades of work on model output statistics. This is useful in serving as a benchmark for error between truth measurements and refined GFS forecasts; while it’s not a lower bound on achievable error (which is limited by the skill of GFS), it serves as a reference for what can reasonably be achieved. 4. Maps between the codes used by NOAA-GFS-MOS to denote weather stations and some combination of two identifiers (sometimes missing altogether) used by the weather stations themselves when reporting truth data. This proved to be difficult given the fact that several files had partial duplicates, missing data, or inconsistent labels, so heuristics were applied to find the most likely match.

Special care was taken in the split of the training, development, and testing sets. Due to two different types of training (per-station networks and recurrent networks), two splits were considered (based on time and based on station); the real testing set constitutes the intersection of the two splits, so that both tasks could be evaluated at the same time in the same places (since the attainable error varies wildly in time and space). Of the 1452 stations with valid data, 85% were used for training (on the RNN model), 7.5% for dev, and 7.5% for test. This leaves us with a bit over a hundred stations for testing, which is enough to discern different models while leaving enough to train station-agnostic methods. For non-RNN methods where we develop a model for each station, we developed the model on each of the 7.5% of stations, training on the first year of data, with subsequent 2 month periods used for dev and test, respectively. The decision to split based on time, rather than randomly, for the time division, was due to the temporal nature of the data and a desire to prevent even subtle forms of overfitting. Given that even a month of data on one station gives 2,880 datapoints, this was deemed to be enough, and makes for over 1.4 million datapoints usable for RNN training. The station-based split can be seen in Figure 1, with the temporal split applying implicitly to each station.

4 Methods 4.1 Baseline model and benchmark In order to have a rough idea of the range of attainable accuracy, we used a very baseline that consists of the raw outputs of the GFS model for temperature and dew point temperature, merely interpolated to the location of the station. The evaluation metric used for this baseline comparison, as well as the loss function (in the squared version) is the p Pn 2 root-mean-square error, i.e. ` = (1/n) i=1(yi − yˆi) . For simplicity, we will focus on just two output statistics of common interest—temperature and dewpoint temperature—predicted 24 hours in advance. Results for this baseline can be seen in Table 1. A helpful comparison is the accuracy obtained (using the same metric) by NOAA’s own output statistics service, i.e. NOAA-GFS-MOS described above. Results are posted in the same table, but should be interpreted as close to the best

3 y1 y2 y3 yT

h1 h2 hT −1 2-RNN 2-RNN 2-RNN 2-RNN

x1, ∆0 x2, ∆1 x3, ∆2 ... xT , ∆T −1

Figure 2: Recurrent neural network model used. one can hope to get given the underlying model, given that it represents decades’ worth of work at NOAA by experts in the field with per-station models.

4.2 Individual neural networks We then examined the possibility of training one neural network for each of the stations in the test set, which has the drawback of only allowing for a limited training set for each station. However, it gives valuable information in learning what’s the best we can do with station-specific models and neural networks (compared to NOAA’s own model output statistics), as well as providing guidance on how to train neural networks in this scenario. The final architecture consisted of a one-hidden-layer neural network with ReLU activation for the hidden layer and two outputs, one for temperature and one for dewpoint temperature. That is,

y = W2 ReLU(W1x + b1) + b2 2 19 20×19 with y ∈ R , x ∈ R , W1 ∈ R and matching dimensions for the rest. The choice of the number of layers and number of hidden nodes is explained in the section below.

4.3 Recurrent neural networks A more complex model considered is motivated by the idea that biases ought to be somewhat temporally stable; in other words, if the raw model underpredicted the temperature at Stanford by one degree today, it is likely that it will also underpredict it tomorrow, since it makes more sense for the biases to be a function of space and geography rather than time. As a result, intuitively the residuals of the model in previous days should make for good inputs to the model, 2 which gives rise to a plausible recursive model. We initialize the “encoded biases”, ∆ ∈ R , as ∆0 = 0, and some H internal state h ∈ R as h0 = 0, and consider a recursive function that propagates information about the biases while propagating information to estimate temperature and dewpoint temperature:

ht = RNNL(ht−1, xt, ∆t−1) From the hidden state we can make a prediction of temperature and dewpoint with a fully-connected neural network: yˆt = FCL0 (ht). From that we’re able to provide an additional estimate of the bias to the next iteration of the recursive 1 application, i.e. ∆t = yt − yˆt. For a number of T steps of previous predictions , we run this recurrent neural network with a fully connected output layer, and output the final predicted temperature and dewpoint of interest at the end. Besides the various hyperparameters, an architectural decision worth discussing is the loss function. We can consider either the loss at the end of the RNN sequence (which most closely resembles the loss function used earlier) or we can add an auxiliary term that corresponds to the loss of the intermediate outputs. In other words, instead of (considering 2 Pt 2 just one of the outputs for simplicity) (yt − yˆt) , we could consider (1/t) i=1(yi − yˆi) . Intuitively, this makes learning easier by providing auxiliary outputs that keep a short path between the loss and the values, rather than suffering from potential vanishing gradients, a common problem in vanilla RNNs [9]. Using more advanced recurrent architectures, such as LSTMs, did not result in a big improvements due to trouble accurately reconstructing the wide range of temperatures after going through a tanh layer with a limited output range (from −1 to 1). Rather, using a simple RNN with a ReLU activation function gave the best results, due to the ease of learning a mapping that mostly preserves the first two (baseline) features and adds corrections on top. The equation used, as in [10], is two layers of:

ht = ReLU(Wihxt + Whhht−1 + bhh) 1We make sure to not violate causality. In other words, we only use previous steps where we have full information, in other words, the last step uses the prediction from 24 hours ago about what the current temperature will be.

4 Method Temperature RMS Dewpoint RMS Baseline (raw GFS) 2.48 2.33 NOAAGFSMOS 1.88 1.77 Individual neural networks 1.92 1.94 Single recurrent network T = 0 2.10 2.05 T = 4 1.88 1.74 T = 12 1.92 1.72 T = 12, auxiliary loss 1.91 1.75 T = 24 1.91 1.71 T = 24, auxiliary loss 1.91 1.70

Table 1: Comparison of errors against ISD measurements from 109 stations, and approximately 51,000 measurements for each estimate of RMS error. The same sites and time intervals were used across experiments.

5 Results and discussion

Results are summarized in Table 1. For training individual neural networks, due to the simplicity of the architecture Scikit-Learn was used [11]. An informal architecture search determined that adding a second layer did not reduce errors and only made it overfit more, as did adding more hidden nodes. Training was relatively straightforward and fast with a learning rate α = 10−3 and (due to the small size of the network) the L-BFGS optimizer. It was very important to standarize the input values, since they spanned many orders of magnitude (for instance, pressure, in Pascals, is on the order of 105 at sea level). That was obtained by preprocessing all data based on the mean and standard deviation of the training set on each column except temperature and dewpoint temperature. Since those variables were (as seen in the baseline) very close to the output we wanted (and we merely wanted to apply corrections to them) they were kept unscaled and unshifted. For the recurrent network, PyTorch [10] was used. The Adam optimizer [12] was used with default parameters except α = 10−3, and a batch size of 128. The algorithm was found to not be very sensitive to those parameters, learning similarly in either case. The number of layers and hidden states of the RNN were chosen to be L = 2,H = 50 due to slightly better performance in the dev set than using fewer, and no returns if using more. Similarly, the fully connected layer was chosen to have L0 = 2,H = 50. For training the recurrent network, using auxiliary loss greatly helped with training speed, but did not result in significantly better evaluation metrics, as we can see in Table 1. Overall, the results for the algorithms proposed are significantly better than the baseline, and have a performance roughly comparable with the model output statistics generated by NOAA. In particular, the performance of training individual neural networks for each station is worse than that of training a single recurrent network that applies to all stations; we can see the effect of using multiple time steps (T 6= 0) as greatly helping compared to the T = 0 (which represents a single fully connected layer for all the stations), which means that the network is in fact learning from the residuals of previous time steps. As we sweep through T , we see that even small values suffice, which means it is possible to make accurate predictions for a new location without requiring years’ worth of training.

6 Conclusions and future work

We have presented an algorithm to produce model output statistics that does not have an explicit dependence on specific weather stations and has comparable performance to state of the art methods, as well as significantly better performance than that of a baseline. Instead of requiring a separate set of equations for each site, as some previous work sometimes requires, a single model is able to learn the biases with a strong temporal correlation by using a recurrent network that learns the bias over time. Adding auxiliary losses can be used to significantly speed up training but does not ultimately yield a better model. The consequences of the feasibility of building a model like the one demonstrated here is that it makes it possible to adapt to a new NWP model much faster, by allowing for greater data reuse, as well as making it a lot easier to add a new weather station that we want to generate predictions on, e.g. in a new town. An interesting problem left for future work is, rather than learning biases temporally, tackling the problem of learning biases spatially. If we make the assumption that the biases between the NWP model and the observed values are a function of the local geography around the station, if we were to feed the surrounding elevation map and land use map to a convolutional network it might be able to learn that relation, e.g. being on a hill next to a river might induce a negative bias to the temperature. Of course, the best architecture would probably mix a temporal component powered by a recurrent network and a spatial one powered by a convolutional one.

5 References [1] William H Klein and Harry R Glahn. Forecasting local weather by means of model output statistics. Bulletin of the American Meteorological Society, 55(10):1217–1227, 1974. [2] Tilmann Gneiting, Adrian E Raftery, Anton H Westveld III, and Tom Goldman. Calibrated probabilistic forecasting using ensemble model output statistics and minimum crps estimation. Monthly Weather Review, 133(5):1098–1118, 2005. [3] Stephan Rasp and Sebastian Lerch. Neural networks for postprocessing ensemble weather forecasts. Monthly Weather Review, 146(11):3885–3900, 2018. [4] Mary C Erickson, J Brent Bower, Valery J Dagostaro, J Paul Dallavalle, Eli Jacks, John S Jensenius Jr, and James C Su. Evaluating the impact of rafs changes on the ngm-based mos guidance. Weather and Forecasting, 6(1):142–147, 1991. [5] Stavros Antonopoulos, Pierre Bourgouin, Jacques Montpetit, and Gerard Croteau. Forecasting o 3, pm 25 and no 2 hourly spot concentrations using an updatable mos methodology. In Air Pollution Modeling and its Application XXI, pages 309–314. Springer, 2011. [6] Caren Marzban. Neural networks for postprocessing model output: Arps. Monthly Weather Review, 131(6):1103– 1111, 2003. [7] Philippe Lauret, Hadja Maïmouna Diagne, and Mathieu David. A neural network post-processing approach to improving nwp solar radiation forecasts. 2014. [8] Adam Smith, Neal Lott, and Russ Vose. The integrated surface database: Recent developments and partnerships. Bulletin of the American Meteorological Society, 92(6):704–708, 2011. [9] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998. [10] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019. [11] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011. [12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

6