Generalizing Model Output Statistics with Recurrent Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
Generalizing Model Output Statistics With Recurrent Neural Networks Joan Creus-Costa Departments of Computer Science and Physics Stanford University [email protected] Abstract We present an architecture using a single neural network to learn a mapping between the raw outputs of a numerical prediction model and model output statistics of interest, in this case temperature and dewpoint temperature. Instead of learning a model for each of thousands of weather stations, we enable data reuse and better learning by using a single recurrent neural network that learns the biases based on the errors of the model in the previous days. This greater generalization enables for better adaptation to changes in NWP models and makes it easier to add new locations for which predictions of model output statistics are desired. It achieves comparable performance to NOAA’s state of the art model output statistics, with a root-mean-square error of 1.88 and 1.70 degrees Celsius in predicting temperature and dewpoint 24 hours in advance. 1 Introduction Global numerical weather prediction models, such as the American GFS or the European ECMWF, produce outputs for various variables—such as humidity or temperature—on a coarse grid with a resolution of approximately 27 km. These raw forecasts, however, do not directly translate to weather forecasts as we know them. The issue with them is threefold: one, it does not capture all the variables humans care about (such as probability of rain, or surface-level quantities); second, it can have a variety local and global biases and systematics; and thirdly, its limited resolution means interpolation is required to produce predictions of meaningful quantities at the points of interest [1]. As a result, a post-processing step, called Model Output Statistics (MOS) is applied to the raw model grid’s values to generate the desired quantities at the desired points. In order to generate this map from the numerical prediction to the refined, debiased forecast, a training set is gathered by looking at ground truth data collected at weather stations around the country, and then a regression is performed between the model outputs and ground truth to learn the biases and corrections. Traditionally, these model output statistics take the form of a multivariate linear regression (with some papers trying out ensemble probabilistic techniques [2], random forests, or neural networks [3]), usually applied independently to each weather station that has ground truth data. An interesting observation is the fact that, by the nature of what we’re trying to learn—localized biases and local climatology [1]—it would seem that the amount of training data that can be used for the models is fairly limited, because it scales merely with the cumulative time a given weather station has been operating, instead of the total data collected, which scales with the product of time elapsed and number of stations. In other words, having to create different models for each station means that data cannot be reused, even if intuitively it might make sense that some low-level characteristics are shared. On top of that, the amount of data that can be used for a given station is limited, too, since as the numerical weather model changes, the distribution of the errors shifts, which means that the regression cannot use old datapoints, [4] since the distribution of the error shifts over time. From this observations we motivate a desire to create a single model that can perform the MOS corrections for any station, instead of on a per-station basis. The benefits, besides potentially better accuracy coming from the increased training size, include the fact that MOS can adapt to changes in the NWP model much faster: if it previously took T days’ worth of data to collect enough data to recalibrate each station’s coefficients after a model change, with a unified CS230: Deep Learning, Fall 2019, Stanford University, CA. (LateX template borrowed from NIPS 2017.) Figure 1: Weather stations providing ground truth data, representing the intersection of ISD stations and MOS stations that had valid recorded measurements over recent years. The colors represent how they were used as part of RNN learning: training in blue, dev in red, and test in green. For the station-by-station tests, only the green stations were used. system data could be theoretically collected in T=N days (with N being the total number of stations, which is a few thousand). This realistically might mean that a few weeks may suffice. In particular, this paper will focus in predicting temperature and dewpoint temperature 24 hours ahead, based on features from the raw GFS model run. 2 Related work Most of the previous literature has used traditional learning methods, such as multivariate linear regression. For instance, in [5], one equation per station per predictand is produced as the output of their algorithm that outputs model output statistics for various air quality outputs. Another family of algorithms looks at ensemble outputs of the weather models and outputs probability distributions, as in [2]. This paper will consider the (simpler) problem of merely giving a single output rather than a full distribution. Neural networks have been used for model output statistics before. In [6], one neural network is trained for each of 31 stations in order to predict temperature. Nine input features are normalized and fed to a single-hidden-layer neural network with logistic activation. The number of hidden is varied and they find that the optimal number ranges from zero to eight depending on the station. Another example, applied to solar radiation forecasts, is [7]. Perhaps the most relevant and interesting related work was the 2018 paper in [3]. It creates ensemble forecast using neural networks, but in a departure from the previous work mentioned earlier (which mostly relied on a simple single hidden layer), it uses a more modern architecture, and tries to train a single big network. With a similar motivation of allowing station-specific information while training a single network, the paper uses an embedding—commonly found in natural language processing tasks—of size nemb = 2 to capture station-specific biases. That is, each of the 537 stations considered had two latent features that are learned and then go into the neural network. By combining that and adding auxiliary features besides temperature, it was able to beat other benchmark model output statistics. This interesting approach still used stations as explicit inputs (in that the station selects which embedding is used), but can use a much larger training corpus since it’s otherwise a single network. In this work, we will try to extend those ideas and remove the explicit station dependence by using a recurrent network instead that learns the biases based on the previous few days. In the interest of full disclosure, the author is not experienced in the field of meteorology and the above represents their best efforts in understanding developments and trends in the fields. 3 Dataset and Features Building a dataset and benchmark for the proved to be a challenging task, with many distinct datasets that required extensive post-processing and error checking. Due to the large amounts of data required to process—over a terabyte compressed—we created a custom framework for downloading and processing large amounts of data on a single-node, 2 48-core, 96-thread machine. This framework scaled well and let us process all the data required in parallel and fast, with caching proving particularly useful in the early stages of development. The datasets were mostly acquired from NOAA’s servers that have historical data and some historical predictions, with about 600 lines of Python. They are listed below: 1. Truth measurements performed by weather stations across the United States. These are available sometimes going as far as year 1901, but only recent measurements (from 2014 on) were downloaded since NWP models change every few years. These come from the Integrated Surface Database (ISD) [8]. 2. Global Forecast System predictions on a grid. For each GFS prediction cycle (which runs four times a day, at times 0, 6, 12, and 18 hours from midnight UTC), we download the forecasts every 3 hours up to 24 hours in advance, and we extract variables deemed relevant for our task at hand. The list was crafted based on the features used by [3] in their model output statistics, and includes a total of 19 measurements for each point in the grid. These are: 2-meter temperature, 2-meter dewpoint temperature, 2-meter relative humidity, convective available potential energy, latent and sensible heat net flux; downward short-wave, downward long-wave, upward short-wave, and upward long-wave radiation fluxes; surface pressure, grid elevation, volumetric soil moisture at 0 meters depth, geopotential height at 500 and 850 hPa, and the u and v components of wind at 500 hPa. On top of that, for the recurrent model, we add latitude, longitude, elevation, and fraction of the year (0 on January 1, and 1 on December 31), and the previous error, explained in Section 4.3. A post-processing script extracts those variables and saves them to binary matrices (indexed by the filesystem), which allows for very quick retrieval of values by using Linux memory maps supported by NumPy, with a custom function written to interpolate to a particular set of coordinates. Due to limited data on the NOAA website, only forecasts after June 2018 could be downloaded. The author spent some time trying to download a longer historical set from NCEP’s Research Data Archive but wasn’t successful; using the more limited dataset from the second half of 2018 on was deemed to be enough after getting early results.