Research Collection
Master Thesis
Structured Attention Transformers on Weather Prediction
Author(s): Ernst, Lukas
Publication Date: 2021
Permanent Link: https://doi.org/10.3929/ethz-b-000483966
Rights / License: In Copyright - Non-Commercial Use Permitted
This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.
ETH Library Structured Attention Transformers on Weather Prediction
Master Thesis
Lukas Ernst
Scalable Parallel Computing Laboratory ETH Zürich
Advisors: Nikoli Joseph Dryden Tal Ben Nun
Supervisors: Prof. Dr. Torsten Hoefler
May 5, 2021 Abstract
Having accurate weather forecasts is of great importance. Not only do they influence our daily decision making, but they also affect our social lives, well-being and even our economy. Take hurricanes as an example responsible for damage amounting to billions of dollars and can even threaten people’s lives. Current state-of-the-art methods rely on physics simulations that solve complex partial differential equations. However, solving such equations is time-consuming and requires several hours, even on modern supercomputers.
Deep learning models take some time to train but can give predictions in a matter of minutes. Recent advancements in weather forecasting using deep learning have shown big improvements in performance that are up to par with NWP models run at the same resolution. This thesis focuses on one such advancement, called the Transformer architecture, that originally raised the bar in Natural Language Processing. This architecture has the problem that it has to flatten the input, which breaks the spatial relations in the data. Also, Transformers are permutation invariant, which is not desirable for spatial tasks such as weather.
This thesis presents a novel model architecture consisting of an optional ResNet backbone followed by a Transformer based on axial attention and efficient global self-attention. Additionally, we introduce a Transformer-only model that completely replaces convolutions with attention blocks. Our main contribution is to preserve the spatial structure of the data during self-attention. By adding two different optional weighting layers after the attention block, we try to help the model guide its attention and reduce the emphasis on the earth’s poles. All the investigated models predict geopotential at 500hPa, temperature at 850hPa and the temperature two meters above ground all at once in a continuous setting. We evaluate our models on ERA5 data and pick three candidates to compare on extreme weather events.
We show that adding a Transformer on top of a ResNet backbone only slightly increases performance in the mean and that the Transformer-only models perform worse or similar to other ResNets. Our best model increases the performance of prior work by Peter Tatkowski [1] by at least 18% on all output variables. It also marginally outperforms the direct ResNet19 (unpretrained, ERA5 -only) by Rasp and Thuerey [2] for all lead times and also beats their continuous ResNet19 (unpretrained, ERA5 -only) for 3 days lead time. We found that the Transformer-only models are very close to our best model in terms of performance. Further investigations of the attention heads, attention maps and saliency maps provide deeper insight into the Transformer’s contributions to the ResNet and reveal peculiar focus on the Antarctica and regions of big global ocean currents.
i Acknowledgements
Errors like RuntimeError: shape ’[-1, 400]’ is invalid for input of size 384,or questions like “Do our predictions even make sense?”,and“Are the normalizations for the data broken?”, were constant companions during the development of this thesis. I often looked at various plots and thought to myself “What is the model even doing?”. The state of mind is perfectly visualized in the xkcd in Figure 1. But exactly these are the exciting parts! Pushing ahead into more or less unknown territory on a subject that concerns everyone.
I’m very grateful to have been granted the opportunity to write my master thesis in the SPCL group under Professor Torsten Hoefler. I also want to thank my two advisors, Nikoli Dryden and Tal Ben-Nun, who always helped me out, discussed open questions with me and nudged me in the right direction. They also provided me with the initial set of papers to get me up to speed in recent advancements and related topics.
The appearance of COVID-19 has not made things any easier. Staying at home and being limited in variety in daily life was tough and did not help drawing extra motivation at times. My gratitude goes towards my friends, family, and partner for always having an open ear, for being absolute motivators in tough times, and for their patience during my technical talks sometimes.
Figure 1: Figure from xkcd - A webcomic of romance, sarcasm, math, and language. [3].
ii Contents
Abstract i
Acknowledgements ii
1Introduction 1
2TransformersandAttention 3
2.1 What is Attention? ...... 3
2.2 Scaled Dot-Product Attention ...... 3
2.3 Multiple Heads ...... 4
2.4 Self-Attention Transformer ...... 5
2.4.1 Self-Attention ...... 6
2.4.2 Feed Forward Network ...... 6
2.4.3 Residual Connection & Normalization ...... 6
3StructuredAttention 8
3.1 Axial Attention ...... 8
3.2 Global Self-Attention (GSA) Module ...... 9
3.2.1 Content Attention ...... 9
3.2.2 Positional Attention ...... 10
4 Data and Baselines 11
4.1 ERA5 ...... 11
4.2 WeatherBench ...... 11
4.2.1 Baselines ...... 12
4.3 Normalization ...... 14
4.4 Data Subset ...... 18
iii Contents iv
4.5 Transformations ...... 18
4.6 Known Dataset Inconveniences ...... 19
5Models 21
5.1 General Network Architecture ...... 21
5.2 GSA-(Res)Net Forecasters ...... 23
5.2.1 Adapted ResNet Block ...... 24
5.2.2 GSA Block ...... 25
6 Experiments on WeatherBench 27
6.1 Setup ...... 27
6.2 Predictions ...... 28
6.3 Training ...... 29
6.4 Results ...... 30
6.4.1 Relative Positional Embeddings ...... 32
6.4.2 First Indicators on Future Ideas ...... 35
6.5 Extreme Weather Events ...... 35
6.5.1 Storm of the Century (1993) ...... 37
6.5.2 Hurricane Katrina (2005) ...... 43
6.5.3 Cyclone Emma (2008) ...... 50
6.6 Sensitivity Analysis ...... 57
6.6.1 Variable Lead Time ...... 57
6.6.2 Behaviour under Perturbations ...... 58
6.7 Discussion ...... 58
6.7.1 Performance ...... 58
6.7.2 Lacking Data? ...... 59
6.7.3 Attention Heads & Affine Layers ...... 61
6.7.4 Sensitivity ...... 61
6.7.5 Where’s the sequence? ...... 62 Contents v
7RelatedWork 64
7.1 Attention Augmented Methods ...... 65
7.2 Global Self-Attention ...... 65
7.3 Efficient Attention ...... 66
7.4 Weather Forecasts ...... 67
8 Conclusion & Future Work 70
8.1 Conclusion ...... 70
8.2 Future Work ...... 70
AAppendix A-1
A.1 Best Model Configurations ...... A-1
A.2 Extreme Events ...... A-4
A.2.1 Storm of the Century (1993) ...... A-4
A.2.2 Hurricane Katrina (2005) ...... A-11
A.2.3 Cyclone Emma (2008) ...... A-18 Chapter 1 Introduction
To rain or not to rain, that is the question. We are used to having accurate weather forecasts available in an instant, to quickly decide if we should take an umbrella with us or not. Not only does it affect our social and personal decisions, but it also has a huge economic impact and in some cases even concerns our well-being in unforeseen or late detection of extreme weather conditions. Current state-of-the-art methods rely on physics models that solve partial differential equations. As accurate as they are, they usually require several hours to be computed, even on modern supercomputers.
Recent advancements in deep learning have shown promising results towards fast weather predictions that perform similarly to numerical weather predictions [2, 4]. In contrast to physics models, deep learning models may take a long time to train, but inference can be made in a matter of minutes or even seconds. Thus it makes sense to push this research direction to find models and methods that perform similar or better than existing physical models.
Inspiration for new methods might be drawn from image-related tasks, due to the similar locality and pixel coherence in the data. Several deep learning models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have shown state-of-the-art performance in image classification [5–8], object detection [5, 6, 9, 10], and image segmentation [11–14].
One recent architecture of particular interest is the Transformer, which has been introduced by Vaswani et al. [15]. The biggest advantage of using Transformers is modelling global input relations and thus capturing long-range interactions on a per-layer basis. This has proven to be the case in Natural Language Processing (NLP), where Transformers have set a new milestone in performance [16–20]. In contrast to convolutional layers, the interaction reach is not limited by a fixed-sized neighbourhood or distance, making attention a natural candidate for image and other higher dimensional tasks. Many papers have successfully used Transformers on various tasks, such as image classification [21–25], object detection [16, 26], and image segmentation [22, 26]. This opens up the question if we can also expect better performance applying Transformers on weather prediction tasks.
We believe that the Transformer’s ability to capture and transform pixel relations on a per layer basis is well suited for weather prediction tasks. This thesis is an extension to the work by Peter Tatkowski [1] on sparse attention on weather prediction. It has the goal to improve performance and generally progress Transformers on weather forecasts. The main contribution of this thesis involves maintaining the spatial structure of the data during self-attention for higher-dimensional data. We will present and evaluate different model architectures that mix a ResNet with a Transformer. Also, we present a model that consists of a Transformer-only architecture and hence gets rid of all the convolutions. We evaluate candidate models on three extreme weather events in the last 30 years and investigate what the Transformer’s contributions 1 1. Introduction 2 to the ResNet backbone are.
The thesis is structured as follows: We begin by explaining the basics of attention in Chapter 2. We then continue with the introduction of structured attention by exploring axial attention [22, 23] and global self-attention modules [27]inChapter 3. Before diving into the different model architectures, we first need to explain what dataset we worked with and what variables are available. Most importantly, we will state what subset of the data was utilized and what transformations have been applied, which is done in Chapter 4. Chapter 5 then goes over the investigated model architectures and provides the corresponding results and discussion in Chapter 6. Chapter 7 motivates the influences of recent progress in related fields, and Chapter 8 contains our conclusion and additional thoughts on future work. Chapter 2 Transformers and Attention
This chapter will define the essential part of global self-attention inside the Transformer [15] and hopefully give some intuition for the choices in design. First, we need to introduce some terminology for the different entities required and then define Scaled Dot-Product Attention as the basic block for Multi Head Attention. For the rest of the thesis, we will refer to global attention as attention for brevity. Before finally putting together the Transformer, we also need to go over normalizations, residual connections, and feed-forward layers in Section 2.4.
2.1 What is Attention?
Attention is a concept that is used in modern deep learning models. It is a tool that allows models to learn how different parts of a sequence relate, meaning that different parts can focus or attend to other parts of the data. We often speak of three different tensor entities called Query, Key and Value. Those entities can be obtained by projecting the input data (or input sequence) into different latent spaces, one for each entity. Attention can be modelled as a mapping function from a query and a set of key-value pairs to a real-valued output, which is a weighted sum of the values. To know how much the values contribute to the result, a compatibility function between the query and the keys is computed. We will use the Scaled Dot-Product Attention first introduced in Vaswani et al. [15].
2.2 Scaled Dot-Product Attention
For attention to express how much and where parts of the sequence are related, we need a compatibility function. One such function is the scaled dot-product. Let us denote the dimension of the query vectors and keys as dk and the dimension of value vectors as dv. The attention function is computed over multiple queries at once. The queries are packed into a single matrix n d n d Q R ⇥ k , the same applies for the keys with matrix K R ⇥ k and the values with matrix 2 n dv 2 V R ⇥ , where n is the sequence length. 2 Figure 2.1 shows a diagram of how to compute the scaled dot-product attention. The attention function first computes a dot product between all the query and transposed keys, followed by a scaling factor of pdk and a softmax function. According to Vaswani et al. [15] this scaling was crucial to avoid small gradients when applying the softmax function. At this stage, we can apply a mask to constrain the values it is allowed to see. Such a mask would only be required for training autoregressive models. This mask is not necessary for our task. After applying a softmax function we get the weights matrix W that can be thought of as magnitude
3 2. Transformers and Attention 4 of contributions towards the result. The attention’s result then is the multiplication of all the contributions with the values matrix V .
QKT Attention(Q, K, V )=softmax V (2.1) pd ✓ k ◆ W | {z }
Figure 2.1: Diagram of the scaled dot-product attention. Consisting of a matrix multiplication between the queries Q and the keys K followed by a scaling and optional masking for autoregressive models. The result is run through a softmax layer and then multiplied with the values V . Figure adopted from Vaswani et al. [15]
2.3 Multiple Heads
We usually call one attention block head. Instead of using one attention head that operates in one latent space, the authors Vaswani et al. [15] suggest combining H different attention blocks at once. This process is called Multi Head Attention (MHA). Using MHA they could increase their performance and show that the different projections of the queries, keys, and values into different latent spaces represent different features of the input sequence and attend differently. Q K V Thus, in MHA we have H learned linear projections Wi , Wi , Wi . We can then perform attention over each representation h for i [1,H] in parallel: i 2
h = Attention(QW Q,KWK ,VWV ) i [1,H] (2.2) i i i i 8 2
The computed values of every attention head h1,h2,...hH are concatenated and run through a final projection W O to combine all the intermediate attention results, visible in Equation (2.3). Figure 2.2 visualizes the different operations involved in multi-headed attention.
O MHA(Q, K, V )=Concat(h1,h2,...,hH )W (2.3) 2. Transformers and Attention 5
Figure 2.2: Learning different representations by using h different scaled dot-product attentions on the linear projected queries Q, keys K and values V . Figure adopted from Vaswani et al. [15]
2.4 Self-Attention Transformer g x N Encodin Positional
X MHA Embedding Add & Norm & Add Norm & Add Feed Forward
Figure 2.3: The Transformer Architecture consisting of stacked encoder layers that rely on the multi-head attention mechanism.
The transformer architecture introduced in Vaswani et al. [15] consists of two components, the encoder and the decoder. For this thesis, we will only use the encoder part of the network, since many encoder-only networks have shown to perform well [17, 24, 28]. A picture of the encoder-only architecture is provided in Figure 2.3. 2. Transformers and Attention 6
Encoder
One encoder block consists of two sublayers, the self-attention module and a feed-forward network, each followed by a residual connection and normalization. One can also do the normalization before the residual connection. To get the full encoder, one simply stacks several encoder blocks on top of each other.
2.4.1 Self-Attention
Self-attention is a more simplified version of the general multi-head attention. Instead, for the matrices Q, K,andV to be all different from each other, we set them all to be the same matrix X, which will denote our input to the model. Hence the attention function is:
SelfAttention(X) = MHA(X, X, X) (2.4)
It is crucial to note here that self-attention is permutation invariant without adding a positional encoding and needs further extension to work on spatial input!
2.4.2 Feed Forward Network
The Transformer architecture [15] depicted in Figure 2.3 requires a Feed Forward network, whose purpose is to process the output of one attention layer in such a way that the next layers can benefit from it. The feed-forward network can be described by two composed linear transformations with an activation function fa in between, i.e. LeakyReLU.
FeedForward(x)=Conv2d(fa(Conv2d(x))) (2.5)
2.4.3 Residual Connection & Normalization
The residual connection and normalization are tightly connected to the sublayers and can be deployed in two different modes. Let us denote the sublayer as a function and the input signal F at layer l as xl. We speak of PreNorm if we apply normalization to the input of the sublayer. On the contrary, if we apply normalization after the residual connection we call it PostNorm.Which placement works best must be found out by experiments, for this thesis we stick to PostNorm using LayerNorm as it was introduced in the original Transformer by Vaswani et al. [15]. 2. Transformers and Attention 7
Figure 2.4: Depiction of the different normalization placements. PreNorm applies the normalization to the input to the submodules, whereas PostNorm normalizes the input after the residual connection. Figure adopted from Wang et al. [29]. Chapter 3 Structured Attention
One major issue of attention introduced in Chapter 2 is the loss of spatial relations of the data due to flattening the input. For one-dimensional data it does not matter since it follows naturally. Though, treating images or higher-dimensional data as sequences by flattening them to a one-dimensional tensor breaks spatial dependencies. Also, we want to emphasize again that the Transformer is permutation invariant and hence needs some extensions to handle spatial data. In this chapter, we want to motivate some techniques, how we might be able to keep the spatial relations during attention.
3.1 Axial Attention
A natural candidate for high-dimensional data and attention are axial transformers, as introduced in the paper by Ho et al. [23] or extended later in Wang et al. [22]. Instead of applying global self-attention, those models compute attention over a single axis at a time. This has many advantages: First, we don’t need to flatten the input tensor. Second, we benefit from less computational and memory cost than vanilla self-attention, because the dimension of a single axis is usually way less than the dimension of the input tensor.
Figure 3.1: Schema of axial attention performed over each dimension separately. Image shows operations for a 2D image, where we compute attention first over the height followed by attention over the width. Figure adopted from Wang et al. [22]
Let us consider an image of N = H W pixels. Axial attention has a computational and ⇥ memory cost of (W H2 + H W 2), because we first compute attention over W sequences of O · · size H. Analogously, computing attention over the width involves H sequences each of size W . This saves us (max(H, W)) or (pN) for a square image compared to standard self-attention. O O Generally for a tensor of dimension d with equally large dimensions of size S and N = Sd many pixels, we are able to save a factor of (N (d 1)/d). O 8 3. Structured Attention 9
For Axial Transformers to work, we simply stack many layers of attention computed over all the available axes to gain a full receptive field. Hence, we inherently rely on the fact that eventually, all the pixels can propagate information to any other pixel, similar to graph attention [30]. In axial attention every pixel can propagate its information after at most d many hops, where d is the number of axes of the input tensor. A sample PyTorch implementation can be found on GitHub1.
3.2 Global Self-Attention (GSA) Module
The Global Self Attention (GSA) module has first been introduced in Shen et al. [27]andis based on the MHA mechanism from Chapter 2. MHA relies on axial attention but introduces a few changes. Every output pixel is computed by combining spatial and content information from every input pixel. Let us denote the spatial dimensions of the data as W and H, as well as the input dimensions din and output dimensions dout. The number of total pixels then is N = H W . i WH d · The input features can then be defined as F R ⇥ in with the corresponding output features o WH dout 2 as F R ⇥ . 2 The module shown in Figure 3.2 consists of two parallel attention layers called Content Attention and Positional Attention. As the name suggests, the first layer aims to learn content-based attention maps, whereas the second layer learns features based on spatial positions.
Figure 3.2: Inner workings of a GSA module. First project input sequence into keys, queries and values. In parallel compute content and positional attention. Content attention combines the keys and values and is hence linear in the input sequence. Positional attention works like axial attention on the last two axes and also adds learned relative positional embeddings to columns and rows. Figure adopted from Shen et al. [27]
3.2.1 Content Attention
Like the standard attention in Chapter 2, we first project the input sequence onto keys K, queries Q, and values V by using a convolution with kernel size 1. We can then compute the output feature map F c by first applying a softmax to each transposed key row and then multiplying the result with the values, resulting in an intermediate tensor of size d d . This has the huge k ⇥ out benefit of avoiding the squared computational complexity when combining this tensor with the queries to obtain the final output feature map F c. Hence the resources required by this layer are of order (N). O 1https://github.com/lucidrains/axial-attention 3. Structured Attention 10
c T WH dout F = Q ⇢ K V R ⇥ (3.1) 2 Shen et al. [27] state that this attention method can be thought of first gathering the features T in the value matrix V into dk global context vectors using the weights of ⇢(K ). Then the values are redistributed to single pixels using weights in the query matrix Q. The authors also reported a significant performance drop of 1% on top-1 accuracy on ImageNet using softmax normalization on the queries. One possible explanation could be that normalizing the queries via a softmax operation constrains the features to be a convex combination of the global context vectors, which might be a too hard constraint. Hence we will also not make use of it in this thesis.
3.2.2 Positional Attention
This additional layer is needed to incorporate the spatial information into the attention since the content attention layer is equivariant under pixel shuffles. Positional attention is inspired by axial attention and first computes attention over the columns only, followed by optional batch normalization and a final attention layer over the rows. But instead of only computing the attention, we also apply a learned relative positional embedding to it in each step. This learned embedding is based on a parameter L, which defines a local neighbourhood’s size around each pixel. Hence, every output pixel can be computed by combining contributions of every pixel in a relative neighbourhood of size L L. ⇥ L 1 L 1 Given L, the neighborhood can be defined by a set of fixed offsets = 2 , , 0, , 2 . c L d { ··· ··· } We will now introduce the learned positional embeddings R R ⇥ k and describe the output 2 of the positional attention layer after one column-wise pass, since the row attention part is done c L dout analogously. For this we also need to define a new matrix V R ⇥ that denotes the L i,j 2 neighboring pixel values in the same column of pixel (i, j). Additionally let the query for the c pixel (i, j) be qi,j. Then the output of the positional attention fi,j can be written as follows:
c cT c fi,j =(qi,jR )Vi,j (3.2)
The modules enable us to apply a batch normalization between the content and row attention, which we did not use in our experiments. Let’s call the output of the optional batch normalization c bi,j. The row attention part can be defined analogously to Equation (3.2).
r c rT r fi,j =(bi,jR )Vi,j (3.3)
As we have seen in Section 3.1, this layer requires (N pN) in terms of memory and computation. O · Chapter 4 Data and Baselines
The performance of a deep learning model is tightly coupled to the dataset and training procedure that has been used. This chapter explains where our data comes from and what data collection we use for training. Further, we introduce the WeatherBench benchmark that serves as a comparison foundation for models on weather prediction. By doing so, it introduces a weighted root mean squared error metric to score each model on. Additionally, we elaborate on applied transformations to the data that benefit the training process and highlight known problems and open data issues.
4.1 ERA5
The data used to train our models is provided by the European Center for Medium-Range Weather Forecasts (ECMWF). It is based on their fifth generation reanalysis data set ECMWF Reanalysis (ERA5) [31] generated using 4D-Var data assimilation and model forecasts in CY41R2 of ECMWF’s Integrated Forecast System (IFS). The weather data spans from 1979 to 2018 and contains hourly field estimates of over 300 parameters available at 137 pressure levels.
To be aligned with the baselines provided in the WeatherBench, we will only use a subset of the ERA5 data. The WeatherBench repository provides additional information on where and how to download the hosted data. Using this dataset, we make the basic assumption that any meteorological biases in weather patterns over the years 1979-2018 are negligible.
4.2 WeatherBench
WeatherBench is a benchmark data set for data-driven weather forecasting that has been published in [32], with the goal to lay out a foundation for new data-driven methods. The data set is based on ERA5 data and offers three resolutions 1.40625°, 2.8125° and 5.625°. WeatherBench only covers a subset of all the ERA5 variables and pressure levels. A list of all the contained variables of WeatherBench is provided in Table 4.2. The baselines include physical simulations and recent deep learning results run on the same data, shown in Table 4.1.
The data repository provides evaluations on z500 (geopotential at 500 hPa), t850 (temperature at 850 hPa), t2m (temperature at 2m above ground) and pr (precipitation) for 3 and 5 days lead time. For this work we will focus only on predicting z500, t850 and t2m.
First we have to define the root mean squard error RMSE. Let the output be of size H W with N H W ⇥ Nlat latitude and Nlat longitude grid points, let P R forecasts⇥ ⇥ be the model’s prediction 2 11 4. Data and Baselines 12
N H W and let G R forecasts⇥ ⇥ be the ground truth over Nforecasts many forecasts. The RMSE can 2 be defined as:
N N N 1 forecasts 1 lat lon RMSE(P, G)= (P G ) (4.1) N vN N i,j,k i,j,k forecasts i u lat lon j k X u · X X t
These scores in the benchmark have been computed by using a latitude weighted RMSE (RMSElat) metric. The weighting is needed to accommodate for the different sizes of the grid squares. Using an even weighting will result in too much emphasis on the poles, since the grid squares near the poles are much denser. The metric is defined as follows:
N N N 1 forecasts 1 lat lon RMSElat(P, G)= L(j) (Pi,j,k Gi,j,k) (4.2) N vN Nlon · forecasts i u lat j k X u · X X t L(j) is the mentioned weighting factor depending on the latitude index j and can be formulated as
cos(latj) L(j)= (4.3) 1 Nlat cos(latj) Nlat j P 4.2.1 Baselines
The WeatherBench baselines listed in Table 4.1 all used the same data. They were run on the same resolution of 5.625°, except the numerical weather models, which were run at coarser resolutions. The baselines can be categorized into four different types of models.
Table 4.1: Table contains NWP models run at different resolutions. Additionally it shows baselines and recent deep learning models scored on the same data contained in WeatherBench using the RMSElat metric on 5.625° resolution.
Model z500 (3/5 d) t850 (3/5 d) t2m (3/5 d) Persistence 936 / 1033 4.23 / 4.56 3.00 / 3.27 Weekly Climatology 816 3.50 6.07 IFS T42 489 / 743 3.09 / 3.83 3.21 / 3.69 IFS T63 268 / 463 1.85 / 2.52 2.04 / 2.44 Operational IFS 154 / 334 1.36 / 2.03 1.35 / 1.77 UNet Weyn et al. [33]373/6111.98/2.87- ResNet19 Direct (ERA only) [2]314/5611.79/2.821.53/2.32 ResNet19 Direct (pretrained) [2] 268 /523 1.65 /2.52 1.42/2.03 ResNet19 Continuous (ERA only) [2]331/5451.87/2.571.60/2.06 ResNet19 Continuous (pretrained) [2]284/499 1.72 / 2.41 1.48 / 1.92
Persistence is one of the simplest forecasting models, which assumes that the current weather configuration persists over the following days. 4. Data and Baselines 13
Weekly Climatology is obtained by simply averaging all the different weeks over time. The computed averages of the weeks contain the seasonal cycles and outperform the Persistence baseline.
Numerical Weather Prediction IFS (Integrated Forecast System) models are currently the standard for medium-range numerical weather predictions. Such models include the operational IFS model of ECMWF and IFS models run at coarser resolutions. IFS T42 uses roughly a resolution of 2.8° and 62 vertical levels, whereas IFS T63 operates at a resolution of 1.9° and ⇠ ⇠ 137 vertical levels. These coarser models were computed to provide a means of comparison in terms of similar computational resources a deep learning model might have.
Deep Learning The set of baselines also include recent deep learning models scored on WeatherBench dataset, like the U-Net of Weyn et al. [33]. The authors tried to reduce the distortions of the data that has been projected onto a 2D grid by mapping the grid onto a cubed-sphere. The currently best data-driven model for both direct and continuous predictions by Rasp and Thuerey [2] uses a pretrained ResNet19.
Table 4.2: Table of variables provided in WeatherBench dataset. Some variables are available at 11 different pressure levels, others are only available on a single pressure level or are constants. Data and description from WeatherBench [32].
VariableName Symbol Description Unit Levels Temperature t Temperature [K]13 2 2 Geopotential z Proportional to height of a [m s ]13 pressure level Relative humidity r Humidity relative to [%]13 saturation 1 Specific humidity q Mixing ratio of water vapor [kg kg ]13 1 Eastward wind u [ms ]13 1 Northward wind v [ms ]13 1 Vorticity (relative) vo Relative horizontal vorticity [s ]13 2 1 1 Potential vorticity pv Potential vorticity [Km kg s ]13 2m temperature t2m Temperature [K]1 1 10muwind component u10 [ms ]1 1 10mvwind component v10 [ms ]1 Total precipitation tp Hourly precipitation [m]1 Total cloud cover tcc Fractional cloud cover [0, 1] 1 2 Incoming solar radiation tisr Accumulated hourly incident [Jm ]1 solar radiation Orography oro Height of surface [m]1 Land sea mask lsm Land-sea binary mask [0, 1] 1 Soil type slt Soil-type categories - 1 Latitude lat2d 2D field with latitude at every [°]1 grid point Longitude lon2d 2D field with longitude at [°]1 every grid point 4. Data and Baselines 14
4.3 Normalization
A first look at a partial histogram excerpt of the data in Figure 4.1 reveals that not all variables follow a normal distribution. Although using Z-Normalization, that enforces a normal distribution on the data, the resulting distribution is not necessarily normal. Since the authors Rasp and Thuerey [2] have achieved good scores using that normalization scheme, we also want to keep it for our experiments.
To prevent information leakage to the test and validation data, the normalization were computed ch pl h w looking only at the train data. Let us denote the input tensor as D R ⇥ ⇥ ⇥ .Weprovide 2 the following data normalizations:
Z-Normalization (Standardization)
The Z-Normalization, or often simply referred to as Standardization,isatoolthatputsthedata into the same scale. By doing so, we standardize the distribution to zero mean and variance ch pl 1. Thus the required terms we need to compute are the mean Dµ R ⇥ and the standard ch pl 2 deviation D R ⇥ per pressure level and channel over all desired time steps. The transformed 2 distribution still has the same shape, but its values and range are scaled, which can be seen in ˆ ch pl h w Figure 4.2. The normalized tensor D R ⇥ ⇥ ⇥ then is: 2
D Dµ Dˆ = (4.4) D
Figure 4.2: Histogram of distribution of t850 after standardization. The Distributions still has the same shape but is a rescaled version of the original t850 distribution.
Min-Max Scaling
ch pl Min-Max scaling involves finding find the minimum Dmin R ⇥ and maximum Dmax ch pl 2 ˆ ch pl h w 2 R ⇥ per pressure level and channel and obtain the normalized tensor D R ⇥ ⇥ ⇥ by 2 Equation (4.5). Min-Max scaling can be thought of as a special case of Standardization with 4. Data and Baselines 15 mean µ = D and = D D . It has the useful property that the normalized values min max min are always in the range [0, 1] for training. A sample histogram for the distribution after the normalization is visible in Figure 4.3.
D Dmin Dˆ = (4.5) D D max min
Figure 4.3: Histogram of distribution of t850 after Min-Max normalization. Note, that the x-axis is scaled but the distribution shape stays the same.
Local Area Standardization (LAS)
One drawback of using one of the above-mentioned normalization schemes is that they normalize every pixel equally, independent of the location. This can cause very local changes to get lost. However, computing a normalization for every pixel also doesn’t cut it, because the normalized values might not be locally coherent anymore and be non-smooth. Local Area Standardization,by Grönquist et al. [34], is a mixture of both ideas and computes the mean and standard deviation of every pixel inside a window of fixed size over time. The first step to compute this norm is padding the input. We pad the input periodically in the longitude direction and pad it in the latitude direction by repeating the last value at the border. We then apply two filters of size k k over each channel, which each compute the mean and the standard deviation inside this ⇥ window. For our experiments we define the filter size to be k =5. Finally, we apply a periodic padding with zero padding in the latitude direction to the data and apply a Gaussian filter with distribution (0, 5) on both outputs. Figure 4.4 provides a visualization of this procedure. The N Gaussian blur is not mandatory but smooths out rapidly changing values, but it might also tamper with the original signal.
We hope to alleviate some work off the model by providing a normalization that already captures some spatial local differences provided in the mean and standard deviations per pixel. Figure 4.5 shows the computed standardization for mean and standard deviation on the variable t850. 4. Data and Baselines 16
Figure 4.4: Visualization of Local Area Standardization by computing moving mean and moving standard deviation over each channel. Finally the intermediate result is padded and we run a Gaussian filter over it. Figure from Grönquist et al. [34]
Figure 4.5: Mean and standard deviations of LAS computed for variable t850 for years 1979-2015 of ERA5. 4. Data and Baselines 17
Figure 4.1: Histogram on partial data including variables z, t, t2m, u, v, q on pressure levels 850 hPa, 500 hPa and 100 hPa. The last row shows variables t2m, tcc and tp that are only available on one pressure level. 4. Data and Baselines 18
4.4 Data Subset
For our experiments in Chapter 6 we only used a subset of the data specified in Table 4.3.To the existing variables, we add a new variable aggr_tp, which stands for the 6-hour aggregated total precipitation, as is recommended in Rasp and Thuerey [2].
Table 4.3: List of all the variables contained in the subset that are used for training.
Variable Name Symbol Temperature t Geopotential z Specific humidity q Eastward wind u Northward wind v Potential vorticity pv 2m temperature t2m Total precipitation tp Total cloud cover tcc Incoming solar radiation tisr Orography oro Land sea mask lsm Soil type slt Latitude lat2d
4.5 Transformations
Log-Transform Precipitation
As we can see in Figure 4.1, the distribution for the total precipitation tp is quite skewed, which does not improve by taking the 6-hour accumulation. Following the suggestions from [2]wealso log transform the distribution of the 6-hour accumulated aggr_tp with ✏ =0.001, as follows:
aggr_tpˆ = log(✏ + aggr_tp) log(✏) (4.6)
Subtracting ✏ from the log transform ensures no initial zero value is non-zero after. The resulting distribution can be seen in Figure 4.6. Note that after this transformation, the 6-hour accumulated total precipitation is still the variable with the most skewed distribution. 4. Data and Baselines 19
Figure 4.6: Histogram of 6-hour accumulated total precipitation aggr_tp on the left and the log-transformed distribution on the right.
Variable Data Shape
We need to address the issue that not all data has the same shape. There are many variables such as t2m, tp or all the wind related components at 10m that are only available on one pressure level. Inspired by the paper of Rasp and Thuerey [2] we combine the pressure level and variable dimension into one dimension, which we call plvars. For instance the variable z of shape (time, var, pl, h, w) is transformed into shape (time, plvars, h, w), where time denotes the number of time slices we give the model during training. We picked three time slices (time =3) at -12h,-6h,and0h for the model to train on. At this point we have the option to leave the time dimension untouched or stack it along the plvars dimension. For the sake of this thesis we will stick to stacking the time slices along the plvars dimension, but also provide one experiment on future work in Section 6.4. The implications of doing this are discussed in Section 6.7.
4.6 Known Dataset Inconveniences
Missing Data
Because of how the ERA5 dataset has been generated, the first six time slices contain NaNsin the variables tp and tisr. Additionally, since we compute the 6-hour accumulated precipitation, three more entries cannot be used. This leads to the first nine time steps that contain NaNsin tp and tisr, which we simply discard before training.
Cold Bias 2000-2006
The Copernicus Climate Change Service (C3S) at ECMWF has published a new version of ERA called ERA5.1 [35], that addresses the issue of a cold bias that was noticed in lower parts of the stratosphere for the years 2000-2006 in ERA5. Using the new dataset, we can expect a more accurate representation of temperature and humidity in lower heights. 4. Data and Baselines 20
ERA5 CDS Data Corruption
Unfortunately, three weeks before the end of this thesis the ECMWF has announced that they have found 361 damaged fields out of 3.1B available fields in the data1. This means that every 1 out of 8.6 million fields is corrupted on average. The corruption can be seen in Figure 4.7 and is noticeable due to a horizontal line whose values are the minimum value of the field. Every user who has downloaded the data prior to 2021-04-15 is affected and this also affects the WeatherBench benchmark.
Figure 4.7: Example image of corrupted data inside ERA5 for some fields. Image taken from [36].
1Information published on https://confluence.ecmwf.int/display/CKB/ERA5+CDS%3A+Data+corruption Chapter 5 Models
This chapter describes the model architectures chosen for our experiments in Chapter 6. We will first explain the common network architecture and the custom Periodic Padding layers, since they are the same in all specific model architectures. We then shift our main focus onto the GSA-(Res)Net architecture in Section 5.2 representing a mixture of ResNet blocks and GSA blocks. It is parameterized so that we can get a ResNet -only network or a GSA-only network from it, which will be useful during the comparisons between the different models.
5.1 General Network Architecture
The general network architecture is depicted in Figure 5.1. It consists of a periodic padding followed by a 7 7 Conv2d that reduces the number of channels to the desired hidden dimension, ⇥ e.g. dim_hidden=128. We then apply a batch norm, activation function fact and dropout in this order, before feeding it to the network at hand. We again apply a periodic padding to the processed network output and reduce the output to three output channels.
2D Positional Encoding
This step is inspired by the positional encoding used in the Transformer architecture [15], whose purpose is to break the permutation invariance of the attention operation. Right after reducing the channels to hidden_dim number of channels, we add a fixed encoding PE to the signal. This should help the model to spatially differentiate pixels (x, y) on different channels and overcome the permutation invariance property of the Transformer. Let i, j [0,D/4). Equations for the 2 positional encoding are provided in Equation (5.1). A sample implementation can be found on GitHub1.
PE(x, y, 2i)=sin x/10000(4i/hidden_dim) (5.1) ⇣ ⌘ PE(x, y, 2i + 1) = cos x/10000(4i/hidden_dim) (5.2) ⇣ ⌘ PE(x, y, 2j + hidden_dim/2) = sin y/10000(4j/hidden_dim) (5.3) ⇣ ⌘ PE(x, y, 2j +1+hidden_dim/2) = cos y/10000(4j/hidden_dim) (5.4)
1https://github.com/tatp22/multidim-positional-encoding ⇣ ⌘
21 5. Models 22
Embedding
The original Transformer architecture by Vaswani et al. [15]inSection 2.4 requires an input embedding that has not yet been addressed in this thesis. In terms of the Transformer on NLP or image tasks, the values can be thought of as tokens and can be embedded in a latent space. Since we have many values of different meanings and different scale, e.g. temperature, geopotential or precipitation, it is not obvious to us how to work around this. Also, adding an extra dimension and keeping the channel dimension intact changes the shape of the data to (batch, embedding, ch, H, W). Thus forcing us to either stack the embedding and channel dimensions or use 3D convolutions for the ResNet and channel reduction parts. We are unsure if this is the way to go and leave this point open. For the sake of this thesis we will treat the channel dimension as our embedding dimension, which has the benefit of having fewer parameters and having similar ResNet architecture and data shape as in [2] for comparisons. In our models we can think of the learned filters of the ResNet as our embedding or treat the whole encoder stack of the Transformer as one.
32
32 Pad BN ReLU Drop Pad PE 32 64 Conv 3x3
Network 128 64 Conv 7x7
Figure 5.1: Depiction of the individual modules involved in our common model architecture. Input is periodically padded and its channels reduced to hidden_dim=128, followed by a fixed positional encoding. The intermediate result is then fed to the network, whose result is again padded and reduced to three output channels.
Activation Functions
Our experiments will use one of the following activation functions. Both activation functions LeakyReLU [37]andPReLU [38] are standard activation functions commonly used in CNNs. The difference between them is that PReLU involves a scaling factor ↵i that can be learned per position i.Ifai = c i for some constant c, then it is equal to LeakyReLU. 8
z, z > 0 LeakyReLU(z)= (5.5) (↵z, otherwise 5. Models 23
zi,zi > 0 PReLU(zi)= (5.6) (↵izi, otherwise
Periodic Padding
As proposed in [2], we apply a periodic padding in the longitude direction and zero-pad the input in the latitude direction. This should help the model to learn that the data wraps around the globe in the longitude direction. Figure 5.2 shows the padding procedure for pad size 5 on a randomly picked temperature slice.
Figure 5.2: Unpadded input (left) and padded input (right) for pad size 5.
5.2 GSA-(Res)Net Forecasters
Our network architecture, shown in Figure 5.3, aims at combining the two ideas of both augmenting an established network with attention and also completely replacing convolutions with attention. To achieve this flexibility we introduce two parameters k and r, which determine the number of ResNet blocks r followed by k GSA blocks. Obviously, if we set k =0,r = 28, then we get our adaptation of a conventional ResNet28. Note, that as of now our architecture does not allow to have GSA blocks in the middle of a ResNet, i.e. having the sequence ResNet14 - GSA8 - ResNet14 of modules. This is on future thought that will be discussed in Section 8.2.Forthe sake of this thesis we will investigate the following configurations:
• k =0,r= 28 ResNet28
• k =8,r= 20 ResNet28 with the last 8 blocks replaced by GSA blocks.
• k =8,r=0 Network with 8 GSA blocks.
• k = 28,r=0 Network with 28 GSA blocks.
We also want the model to be able to disregard any contribution of the Transformer, which we do by wrapping the Transformer network part inside a residual connection. Such we can find 5. Models 24 out how much and where the Transformer contributes to the prediction compared to the ResNet backbone, if it has one.
+
Adapted ResNet Block r GSA Net Block k ⇥ ⇥ Network
Figure 5.3: Detailed view of network part (green block) in Figure 5.1. The model consists of r ResNet blocks followed by k GSA blocks, all wrapped inside a skip connection to eventually cancel out the Transformer altogether.
Hence the network has two distinct network parts, whose blocks need further explaining.
5.2.1 Adapted ResNet Block
Our adapted ResNet block consists of two sequential residual blocks. Every block applies a periodic padding in the longitude direction before running the input through the convolutions. A batch normalization and activation function follow, to which we finally apply some dropout and adding the result back to the original signal. A visualization of such a block can be found in Figure 5.4.
+ 32 32
Pad BN Fact Drop Pad BN Fact Drop
128 64 128 64 Conv 3x3 Conv 3x3
Adapted ResNet Block
Figure 5.4: Schema of an adapted ResNet block that simply wraps two sequential convolutional blocks inside a residual connection. Stacking these adapted blocks enough times yields the yellow network part in Figure 5.3. 5. Models 25
5.2.2 GSA Block
The GSA block contains the basic GSA module presented in Section 3.2,butwealsowantto enhance the standard GSA module by adding a Latitude Weighting and Affine Weighting for more flexibility and help to guide the model’s attention. Both custom layers are optional and are denoted in round brackets in Figure 5.5.
K: Conv 1x1
Content Attention
+ 32 32
Input (Softmax)
128 64 128 64 W lat: Q: Conv F o: Conv A: (Latitude 1x1 1x1 (A neLayer) Weighting)
Positional Attention
V : Conv 1x1
GSA Block
Figure 5.5: Picture of GSA block that consist of the introduced GSA module in Section 3.2 and enhances it with a latitude weighted layer and an affine layer.
Latitude Weighting Wlat
Distortions due to the data projections onto the 2D grid cannot be avoided. Although we can train our networks by using a loss function that incorporates measures to weight the errors accordingly, i.e. L1lat or MSElat, the attention mechanism might not know that it should attend less to pixels near the poles. Our idea is to weight the computed attention by the latitude areas, similar as has been done in Equation (4.2).
attn ch N N Let F R ⇥ lat⇥ lon denote the result of the sum of the Content Attention and Positional 2 Attention. We can then multiply each channel by the latitude weights in Equation (4.3). Before multiplying the latitude weights we need to reshape the data, such that the latitude data points are the last dimension. Afterwards we then permute the tensor back to it’s original shape. For attnˆ ch Nlon Nlat simplicity we will only provide the formula on the already permuted matrix F R ⇥ ⇥ . 2
Fˆlat = F attnˆ L(j) j (5.7) c,i,j c,i,j · 8 5. Models 26
Affine Layer Waffine
This layer can be thought of a generalization to Latitude Weighting. Instead of constraining the model with a fixed set of weights to apply, we let the model learn a set of affine weights for each o ch N N channel and pixel. Let F R ⇥ lat⇥ lon be the result of a GSA module and let the affine layer 2 ch N N ch N N Waffine be described by a weights matrix A R ⇥ lat⇥ lon and bias matrix B R ⇥ lat⇥ lon . 2 2 The output can then be defined as follows:
AffineLayer(F o)=AF o + B (5.8) Chapter 6 Experiments on WeatherBench
The main focus of this chapter are the results of our experiments in Section 6.4. We compare our models to existing baselines and also investigate the performance on three handpicked weather events of the last 30 years, e.g. hurricane Katrina in 2005. We provide visualisations for the learned embeddings and attention heads for each event to determine what the model is focusing on. We are also interested in the stability of the models - how they behave under slight changes to the input, which is addressed in Section 6.6. Before diving into the numbers we want to explain as precisely as possible how the experiments are set up in Section 6.1 and also define how our models predict the state of the weather into the future in Section 6.2. It is also essential that we shed light on how the models were trained, which is done in Section 6.3. Finally, we discuss the obtained results in Section 6.7.
6.1 Setup
Hardware
Due to many experiments and large training times, our experiments were carried out on two different nodes provided by CSCS. This involved node ault06 with 4 NVIDIA A100 SXM4 40 ⇥ GB GPUs,128AMD EPYC 7742 64-Core Processors and 512 GB of shared memory and ault25 with 4 NVIDIA V100 32 GB,72AMD EPYC 7742 64-Core Processors and 726 GB of shared ⇥ memory.
Data Loading
We use PyTorch’s [39] DataLoader class for facilitating efficient and simple data loading. The original xarray dataset is 304 GB and does not fit into memory on most machines. We have spent a lot of time in the first third of this thesis to find the fastest strategy to load the data. The options included having one NumPy file for each day (300k+ files), loading from xarray directly or having single big NumPy file for train, validation and test data. Unfortunately, loading the data directly from xarray [40] as well as splitting the dataset into many NumPy files was too slow, due to the overhead of the file loads. It worked best to convert the different data subsets (train, validation, test) to one big, memory-mapped file, loaded to shared memory before training. The converted dataset amounts to roughly 132 GB and nicely fits into memory on all nodes. The size reduction comes from using only a subset of the data (Section 4.4)andstoringitinNumPy data format. Since we will enable data shuffling inside the DataLoader during training, the accessed indices are random. With NumPy’s option to load a file in memory-mapped mode, random accesses remain
27 6. Experiments on WeatherBench 28 fast, since the whole data resides in memory. This limits us in using PyTorch’s DataParallel to distribute the load evenly among the available GPUs, as DistributedDataParallel would load the whole data once per process and overflows the memory.
Runs, Logs and Checkpoints
All the runs and logs, including plots during training, were managed with Weights & Biases (W&B)[41]. To make sure we have a copy of the checkpoints in the worst case, we not only keep a local copy of the best checkpoints, but we also upload the training script and model code, including the best and latest checkpoint to W&B. The best and latest checkpoints are updated every epoch and every 800 iterations. The best weights are also available through our repository1.
Since we want to have the freedom to try out different normalizations and not generate a new data set per normalization option, we normalize the data before feeding it into the model. This flexibility incurs a small performance loss due to the computational overhead of the normalization.
6.2 Predictions
Prediction Type There are three prediction types:
1. Direct: The model is trained and evaluated to always predict the state at exactly t + , where represents the lead time, i.e. =3days.
2. Continuous: The model is trained to predict the state at t + but the lead time is given as an input parameter. Ideally, the model should learn to do predictions for lead times in the range [0, ] for a randomly chosen . During evaluation, we can then fix to a desired value in the range the model has been trained on (i.e. 3 days).
3. Iterative: The models are trained to predict the state for smaller time steps, i.e. =1 hour into the future. During evaluation, one has to repeatedly apply the model to the intermediate states to get to the desired forecast time, like in a standard autoregressive setting.
What prediction mode to use depends on the task at hand and what properties we want to exploit. One obvious drawback of the direct approach compared to continuous and iterative models is that one model per desired lead time must be trained. However, direct models can learn more specific characteristics for the given lead time. Non-direct models have the advantage that arbitrary forecast times can be chosen, but certain limitations exist. Due to the feedback like nature of the iterative models, like the one in Weyn et al. [33], errors can propagate very rapidly and may lead to huge errors over time. Similarly, if one predicts beyond the training forecast range of a continuous model we might get large errors. It is therefore a trade-off between specificity, generalization and numerical stability. For the sake of this thesis we constrain ourselves to continuous forecasts to remain flexible concerning lead time, stability and evaluation time.
1https://spclgitlab.ethz.ch/deep-weather/structured-weather-transformer 6. Experiments on WeatherBench 29
6.3 Training
The models were trained with PyTorch [39] version 1.7.1.
Loss Functions
We enhance the loss functions already available in PyTorch with the following functions below. N Let Nlat denote the number of latitude points on the grid and let wlat R lat denote the latitude 2 N C N N weights. We then define the model prediction and ground truth to be P R forecasts⇥ ⇥ lat⇥ lon N C N N 2 and G R forecasts⇥ ⇥ lat⇥ lon over Nforecasts many forecasts and C number of output channels. 2 Then the following loss functions give the resulting loss on a per channel basis c.Ifwewanta scalar loss we can simply aggregate all the channels by taking the mean over the channels.
N N N 1 forecasts 1 lat lon L1lat(P, G, c)= Pk,c,i,j Gk,c,i,j wlat (6.1) Nforecasts 0Nlat Nlon | |· 1 Xk · Xi Xj @ A
Nforecasts Nlat Nlon 1 1 2 MSElat(P, G, c)= (Pk,c,i,j Gk,c,i,j) wlat (6.2) Nforecasts 0Nlat Nlon · 1 Xk · Xi Xj @ A
Optimizer
During all experiments we use the Adam optimizer with betas =(0.9, 0.98) and a weight decay factor of 1 10 5. The same amount of weight decay is applied to every layer of the network. ⇥
Mixed Precision
We were unsuccessful in using native mixed-precision offered by PyTorch to increase our performance further. The issue could be tracked down to the last convolutional layer inside the GSA module, which maps the intermediate dimensions to the output dimensions. We are not completely sure what exactly happens, but the result using float16 eventually results in NaN. This issue was not investigated further due to time constraints.
Batches, Shuffle, Split
We use the same data split proposed in Rasp et al. [42], which separates the data into the following three non-overlapping sets:
• Train Set: Covers the years [1979,2016).
• Validation Set: Covers the years [2016,2016).
• Test Set: Covers the years [2017, 2019). 6. Experiments on WeatherBench 30
We also perform random shuffling of the data before handing it to the model. This is only done during training. Upon evaluation, the order of the data is preserved such that the input data used for visualizations and plots are always performed on the same day among different models.
Initialization
We initialize the weights of Conv2ds with Xavier Uniform [43] normalization and set the bias to constant 0. din and dout correspond to the number of input dimensions (or nodes) and number of output dimensions (or nodes) of the layer. The weights of the Conv2d that maps an input tensor from din channels to dout channels are then sampled from the distribution given in Equation (6.3).
Also we initialize the scale of BatchNorm and GroupNorm to 1 and the bias (shift) to 0. Every other layer in the network uses the default initialization provided by PyTorch.
p6 p6 W xavier-uniform , (6.3) ⇠ U " pdin + dout pdin + dout #
Learning Rate Schedule
We kept our learning rate schedule similar to the one proposed in Rasp and Thuerey [2]. We use a ReduceOnPlateau scheduler with factor =0.5 and patience =0, meaning that we half the learning-rate after every epoch, if the validation or test loss did not improve. The initial learning rate was chosen to be 1 10 4, because it yielded good scores and learning rates smaller than ⇥ 1 10 5 resulted in very slow convergence of the models. We believe that the learning rate can ⇥ be further tweaked.
Early Stopping
We consider the model to be converged if the minimum of the validation or test loss does not improve for three consecutive epochs. With the learning schedule above this is equivalent to three learning rate decreases in sequence.
Remarks (Dropout, Training Time)
No dropout led to better validation scores, so we don’t apply it. On the described nodes we achieve a performance of around 1ep/h. The number of epochs required for a model to converge varies greatly, depending on the model’s size. Our larger models usually require 30 epochs or aminimumof30h per experiment.
6.4 Results
The following Table 6.2 shows the results on the test set for variables z500, t850 and t2m for 3 and 5 days lead time for several variations in parameter. Due to space efficiency reasons, we had to abbreviate some of the parameters. The abbreviations and descriptions of the parameters 6. Experiments on WeatherBench 31 are listed in Table 6.1. Our best model compared to the WeatherBench baselines is presented in Table 6.3.
We have also conducted some experiments on Lambda Networks [44], Attention Augmented Convolutions [21], and DenseNets [45] in an earlier phase of this thesis, but we did not dive deeper into those methods, due to slow training performance and bad initial results. At that time the models only predicted t850 for 3 days lead time and got stuck at a RMSElat test score of 2.10. We could increase the training speed by using a linear attention method that combines the values and keys first, but we observed no decrease in the test loss.
Remark: Due to unforeseen incidents in the AULT cluster during the last weeks of this thesis, we were not able to run the models until full convergence. We would have required roughly 1.5 more weeks to complete all our experiments.
Table 6.1: Abbreviations for parameters listed in Table 6.2
Abbreviation Meaning L Size of the relative positional embedding in GSA modules bs Batch size w Use latitude weighing inside GSA module a Use affine layer after GSA module k Number of gsa modules r Number of ResNet layers used Eps. Number of epochs the model was trained Params Number of parameters of the model
Table 6.2: Experiments on WeatherBench using a simple 5 layer CNN5 [42], GSA, GSA-ResNet and ResNet model types for 3 and 5 days lead time predicting z500, t850 and t2m.
#Ldim bs norm activation w a loss k r z500 (3/5 d) t850 (3/5 d) t2m (3/5 d) Eps. Params
CNN5 [ 42 ] 11 -12864standardLeakyReLU --MSElat --634.59 / 757.44 2.87 / 3.31 2.79 / 3.03 29 906K
05 312832standardLeakyReLU 77MSElat 80349.84 / 627.03 1.90 / 2.81 1.59 / 2.17 25 11.4M 06 64 128 32 standard LeakyReLU 77L1lat 80324.48 / 573.77 1.83 / 2.69 1.53 / 2.08 27 11.5M 07 325632standardLeakyReLU 37MSElat 80346.06 / 599.11 1.89 / 2.73 1.60 / 2.14 20 41.7M
GSA-Nets 08 325632standardLeakyReLU 37L1lat 80326.32 / 586.05 1.83 / 2.69 1.54 / 2.13 14 41.7M
01 312832standardLeakyReLU 77MSElat 820329.27 / 553.54 1.85 / 2.60 1.59 / 2.05 20 17.3M 02 64 128 32 standard LeakyReLU 77L1lat 820310.80 / 533.92 1.79 / 2.55 1.52 / 2.04 20 17.4M 03 325632standardLeakyReLU 37MSElat 820352.18 / 566.46 1.95 / 2.66 1.67 / 2.28 11 65.3M 04 312832standardLeakyReLU 37L1lat 820321.09 / 555.92 1.83 / 2.62 1.55 / 2.06 14 17.3M 12 312832standardLeakyReLU 37MSElat 820339.38 / 583.84 1.86 / 2.70 1.59 / 2.14 14 17.3M 14 312832standardPReLU 37L1lat 820359.24 / 578.92 1.95 / 2.70 1.71 / 2.28 717.3M GSA-ResNets 15 312832LAS LeakyReLU 37MSElat 28 0 317.60 / 542.47 1.80 / 2.56 1.55 / 2.14 11 37.6M 16 312832standardLeakyReLU 73MSElat 820310.22 / 550.17 1.77 / 2.58 1.49 / 2.02 55 21.5M
09 -12864standardLeakyReLU --MSElat 028322.91 / 555.19 1.83 / 2.61 1.55 / 2.06 29 9.18M
ResNets 10 -25664standardLeakyReLU --L1lat 028305.40 / 542.32 1.79 / 2.61 1.50 / 2.03 19 34.9M
17 312832standardLeakyReLU 77L1lat 820315.38 / 558.53 1.79 / 2.62 1.50 / 2.05 33 16.7M Future 6. Experiments on WeatherBench 32
Table 6.3: Baselines in WeatherBench and our best models scored on the same data contained in WeatherBench using the RMSElat metric. A light grey row background highlights our best model.
Model z500 (3/5 d) t850 (3/5 d) t2m (3/5 d) Persistence 936 / 1033 4.23 / 4.56 3.00 / 3.27 Weekly Climatology 816 3.50 6.07 IFS T42 489 / 743 3.09 / 3.83 3.21 / 3.69 IFS T63 268 / 463 1.85 / 2.52 2.04 / 2.44 Operational IFS 154 / 334 1.36 / 2.03 1.35 / 1.77 Linformer Weather [1]505/7242.44/3.16- UNet Weyn et al. [33]373/6111.98/2.87- ResNet19 Continuous (ERA only) [2]331/545 1.87/2.571.60/2.06 ResNet19 Direct (ERA only) [2]314/5611.79/2.821.53/2.32 GSA-ResNet #16 (ERA only) 310.22 / 550.17 1.77 / 2.58 1.49 / 2.02 ResNet19 Continuous (pretrained) [2]284/499 1.72 / 2.41 1.48 / 1.92 ResNet19 Direct (pretrained) [2] 268 /523 1.65 /2.52 1.42 /2.03
To keep the number of plots small and manageable in the following sections, we will constrain ourselves to the candidate models shown in Table 6.4 to conduct further research on.
Table 6.4: Candidate models picked from Table 6.2 for further research and evaluations.
# Model L dim bs norm act wa loss kr z500 t850 t2m
10 ResNet - 128 64 standard LeakyReLU --MSElat 028305.40 / 542.32 1.79 / 2.61 1.50 / 2.03 16 GSA-ResNet 3 128 32 standard LeakyReLU 73MSElat 820310.22 / 550.17 1.77 / 2.58 1.49 / 2.02 06 GSA-Net 64 128 32 standard LeakyReLU 77L1lat 80 324.48 / 573.77 1.83 / 2.69 1.53 / 2.08
6.4.1 Relative Positional Embeddings
This section deals with the learned relative positional embeddings of the GSA models. The following subsections provide plots for the column matrix Rc and row matrix Rr for every layer. We can observe that the learned relative positional embeddings in Figures 6.1 and 6.2 are definitely not random. We observe in both figures that almost all the embedding layers are very sparse and give weight to specific relative locations, best visible in Figure 6.2.Also,ithas to be noted that the magnitude of the values in the embeddings in Figure 6.1 are almost zero and most likely do not contribute to the attention. 6. Experiments on WeatherBench 33
GSA-ResNet #16
Figure 6.1: Learned relative positional embedding of size L =3for the model GSA-ResNet #16. 6. Experiments on WeatherBench 34
GSA-Net #06
Figure 6.2: Learned relative positional embedding of size L = 64 for the model GSA-Net #06. 6. Experiments on WeatherBench 35
6.4.2 First Indicators on Future Ideas
The provided list of experiments in Table 6.2 also contains one small experiment on future work. The results of experiment #17 should act as a first indicator of how well novel ideas perform. We were specifically interested in keeping the time dimension in the data grid and not stack it along one channel dimension, like in [2]. Hence, the model architecture does slightly differ from the one presented in Chapter 5. Since these experiments are not the main focus of this thesis, we will merely provide the scores and do not include further research on them. We will elaborate more on future work in Section 8.2, but we will shortly go over the main differences in architecture.
By not stacking the channels and thus collapsing the time dimension into the channels, the input data is of size (batch, time, vars, h, w), where vars stands for the different parameters over different pressure levels, i.e. t850. First, instead of adding a 2D fixed positional encoding to the input, we add a 3D fixed sinusoidal positional encoding computed over the time dimension. For the model to further process the input, we reshape the date to (batch, vars, h time, w). By · doing so the 2D grid consists of consecutive blocks of rows per day, which can be interpreted as a sequence present per parameter channel. Figure 6.3 provides a visualization. Before applying the last layer that transforms the intermediate result to the desired number of output channels, we reshape the data back to (batch, time, vars, h, w) and get rid of the time dimension by taking the mean over it with torch.mean(tensor, dim=1).
Figure 6.3: Example visualization for data after reshaping it to (batch, vars, h time, w). ·
6.5 Extreme Weather Events
To compare the model’s scores we evaluate them using the same metric that averages the prediction performance over all forecasts. Thus, having more accurate forecasts for individual dates only mildly influences the average score. But exactly these individual dates might be of extreme importance, containing weather forecasts under extreme weather conditions. We want to elaborate on that thought and investigate if Transformers perform slightly better in the mean but may convincingly perform better under extreme weather conditions. 6. Experiments on WeatherBench 36
We evaluate the candidate models on the Storm of the Century (1993), Hurricane Katrina (2005) and on Cyclone Emma (2008). Each subsection provides the error for 1, 3 and 5 days lead time for the variables z500 and t2m. Additionally, we provide saliency plots, the top 20 gradients, attention heads and the Transformer’s contribution to the prediction for 3 days into the future. All the remaining plots can be found in the appendix Appendix A.2.
Before diving into the results for the different dates, we want to briefly explain how the plots mentioned above were generated. Most of the plots are based on the gradients computed on the inputs. More specifically, let the gradients for our stacked channels models be W grad B C H W 2 R ⇥ ⇥ ⇥ , where B is the batch size, C is the number of stacked channels and H W is the ⇥ size of the 2D values grid. Additionally, let the prediction be (X; ⇥) and input tensor be X. F Since we only evaluate the model on one specific date, our batch size is B =1. One can then obtain the gradients by setting requires_grad_() on the input tensor in PyTorch and running one backward pass after computing the loss function.