Structured Attention Transformers on Weather Prediction

Research Collection

Master Thesis

Author(s): Ernst, Lukas

Publication Date: 2021

Permanent Link: https://doi.org/10.3929/ethz-b-000483966

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library Structured Attention Transformers on Weather Prediction

Master Thesis

Lukas Ernst

[email protected]

Scalable Parallel Computing Laboratory ETH Zürich

Advisors: Nikoli Joseph Dryden Tal Ben Nun

Supervisors: Prof. Dr. Torsten Hoeﬂer

May 5, 2021 Abstract

Having accurate weather forecasts is of great importance. Not only do they influence our daily decision making, but they also affect our social lives, well-being and even our economy. Take hurricanes as an example responsible for damage amounting to billions of dollars and can even threaten people’s lives. Current state-of-the-art methods rely on physics simulations that solve complex partial differential equations. However, solving such equations is time-consuming and requires several hours, even on modern supercomputers.

Deep learning models take some time to train but can give predictions in a matter of minutes. Recent advancements in weather forecasting using deep learning have shown big improvements in performance that are up to par with NWP models run at the same resolution. This thesis focuses on one such advancement, called the Transformer architecture, that originally raised the bar in Natural Language Processing. This architecture has the problem that it has to ﬂatten the input, which breaks the spatial relations in the data. Also, Transformers are permutation invariant, which is not desirable for spatial tasks such as weather.

This thesis presents a novel model architecture consisting of an optional ResNet backbone followed by a Transformer based on axial attention and eﬃcient global self-attention. Additionally, we introduce a Transformer-only model that completely replaces convolutions with attention blocks. Our main contribution is to preserve the spatial structure of the data during self-attention. By adding two diﬀerent optional weighting layers after the attention block, we try to help the model guide its attention and reduce the emphasis on the earth’s poles. All the investigated models predict geopotential at 500hPa, temperature at 850hPa and the temperature two meters above ground all at once in a continuous setting. We evaluate our models on ERA5 data and pick three candidates to compare on extreme weather events.

We show that adding a Transformer on top of a ResNet backbone only slightly increases performance in the mean and that the Transformer-only models perform worse or similar to other ResNets. Our best model increases the performance of prior work by Peter Tatkowski [1] by at least 18% on all output variables. It also marginally outperforms the direct ResNet19 (unpretrained, ERA5 -only) by Rasp and Thuerey [2] for all lead times and also beats their continuous ResNet19 (unpretrained, ERA5 -only) for 3 days lead time. We found that the Transformer-only models are very close to our best model in terms of performance. Further investigations of the attention heads, attention maps and saliency maps provide deeper insight into the Transformer’s contributions to the ResNet and reveal peculiar focus on the Antarctica and regions of big global ocean currents.

i Acknowledgements

Errors like RuntimeError: shape ’[-1, 400]’ is invalid for input of size 384,or questions like “Do our predictions even make sense?”,and“Are the normalizations for the data broken?”, were constant companions during the development of this thesis. I often looked at various plots and thought to myself “What is the model even doing?”. The state of mind is perfectly visualized in the xkcd in Figure 1. But exactly these are the exciting parts! Pushing ahead into more or less unknown territory on a subject that concerns everyone.

I’m very grateful to have been granted the opportunity to write my master thesis in the SPCL group under Professor Torsten Hoeﬂer. I also want to thank my two advisors, Nikoli Dryden and Tal Ben-Nun, who always helped me out, discussed open questions with me and nudged me in the right direction. They also provided me with the initial set of papers to get me up to speed in recent advancements and related topics.

The appearance of COVID-19 has not made things any easier. Staying at home and being limited in variety in daily life was tough and did not help drawing extra motivation at times. My gratitude goes towards my friends, family, and partner for always having an open ear, for being absolute motivators in tough times, and for their patience during my technical talks sometimes.

Figure 1: Figure from xkcd - A webcomic of romance, sarcasm, math, and language. [3].

ii Contents

Abstract i

Acknowledgements ii

1Introduction 1

2TransformersandAttention 3

2.1 What is Attention? ...... 3

2.2 Scaled Dot-Product Attention ...... 3

2.3 Multiple Heads ...... 4

2.4 Self-Attention Transformer ...... 5

2.4.1 Self-Attention ...... 6

2.4.2 Feed Forward Network ...... 6

2.4.3 Residual Connection & Normalization ...... 6

3StructuredAttention 8

3.1 Axial Attention ...... 8

3.2 Global Self-Attention (GSA) Module ...... 9

3.2.1 Content Attention ...... 9

3.2.2 Positional Attention ...... 10

4 Data and Baselines 11

4.1 ERA5 ...... 11

4.2 WeatherBench ...... 11

4.2.1 Baselines ...... 12

4.3 Normalization ...... 14

4.4 Data Subset ...... 18

iii Contents iv

4.5 Transformations ...... 18

4.6 Known Dataset Inconveniences ...... 19

5Models 21

5.1 General Network Architecture ...... 21

5.2 GSA-(Res)Net Forecasters ...... 23

5.2.1 Adapted ResNet Block ...... 24

5.2.2 GSA Block ...... 25

6 Experiments on WeatherBench 27

6.1 Setup ...... 27

6.2 Predictions ...... 28

6.3 Training ...... 29

6.4 Results ...... 30

6.4.1 Relative Positional Embeddings ...... 32

6.4.2 First Indicators on Future Ideas ...... 35

6.5 Extreme Weather Events ...... 35

6.5.1 Storm of the Century (1993) ...... 37

6.5.2 Hurricane Katrina (2005) ...... 43

6.5.3 Cyclone Emma (2008) ...... 50

6.6 Sensitivity Analysis ...... 57

6.6.1 Variable Lead Time ...... 57

6.6.2 Behaviour under Perturbations ...... 58

6.7 Discussion ...... 58

6.7.1 Performance ...... 58

6.7.2 Lacking Data? ...... 59

6.7.3 Attention Heads & Aﬃne Layers ...... 61

6.7.4 Sensitivity ...... 61

6.7.5 Where’s the sequence? ...... 62 Contents v

7RelatedWork 64

7.1 Attention Augmented Methods ...... 65

7.2 Global Self-Attention ...... 65

7.3 Eﬃcient Attention ...... 66

7.4 Weather Forecasts ...... 67

8 Conclusion & Future Work 70

8.1 Conclusion ...... 70

8.2 Future Work ...... 70

AAppendix A-1

A.1 Best Model Conﬁgurations ...... A-1

A.2 Extreme Events ...... A-4

A.2.1 Storm of the Century (1993) ...... A-4

A.2.2 Hurricane Katrina (2005) ...... A-11

A.2.3 Cyclone Emma (2008) ...... A-18 Chapter 1 Introduction

To rain or not to rain, that is the question. We are used to having accurate weather forecasts available in an instant, to quickly decide if we should take an umbrella with us or not. Not only does it aﬀect our social and personal decisions, but it also has a huge economic impact and in some cases even concerns our well-being in unforeseen or late detection of extreme weather conditions. Current state-of-the-art methods rely on physics models that solve partial diﬀerential equations. As accurate as they are, they usually require several hours to be computed, even on modern supercomputers.

Recent advancements in deep learning have shown promising results towards fast weather predictions that perform similarly to numerical weather predictions [2, 4]. In contrast to physics models, deep learning models may take a long time to train, but inference can be made in a matter of minutes or even seconds. Thus it makes sense to push this research direction to ﬁnd models and methods that perform similar or better than existing physical models.

Inspiration for new methods might be drawn from image-related tasks, due to the similar locality and pixel coherence in the data. Several deep learning models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have shown state-of-the-art performance in image classiﬁcation [5–8], object detection [5, 6, 9, 10], and image segmentation [11–14].

One recent architecture of particular interest is the Transformer, which has been introduced by Vaswani et al. [15]. The biggest advantage of using Transformers is modelling global input relations and thus capturing long-range interactions on a per-layer basis. This has proven to be the case in Natural Language Processing (NLP), where Transformers have set a new milestone in performance [16–20]. In contrast to convolutional layers, the interaction reach is not limited by a ﬁxed-sized neighbourhood or distance, making attention a natural candidate for image and other higher dimensional tasks. Many papers have successfully used Transformers on various tasks, such as image classiﬁcation [21–25], object detection [16, 26], and image segmentation [22, 26]. This opens up the question if we can also expect better performance applying Transformers on weather prediction tasks.

We believe that the Transformer’s ability to capture and transform pixel relations on a per layer basis is well suited for weather prediction tasks. This thesis is an extension to the work by Peter Tatkowski [1] on sparse attention on weather prediction. It has the goal to improve performance and generally progress Transformers on weather forecasts. The main contribution of this thesis involves maintaining the spatial structure of the data during self-attention for higher-dimensional data. We will present and evaluate diﬀerent model architectures that mix a ResNet with a Transformer. Also, we present a model that consists of a Transformer-only architecture and hence gets rid of all the convolutions. We evaluate candidate models on three extreme weather events in the last 30 years and investigate what the Transformer’s contributions 1 1. Introduction 2 to the ResNet backbone are.

The thesis is structured as follows: We begin by explaining the basics of attention in Chapter 2. We then continue with the introduction of structured attention by exploring axial attention [22, 23] and global self-attention modules [27]inChapter 3. Before diving into the different model architectures, we first need to explain what dataset we worked with and what variables are available. Most importantly, we will state what subset of the data was utilized and what transformations have been applied, which is done in Chapter 4. Chapter 5 then goes over the investigated model architectures and provides the corresponding results and discussion in Chapter 6. Chapter 7 motivates the influences of recent progress in related fields, and Chapter 8 contains our conclusion and additional thoughts on future work. Chapter 2 Transformers and Attention

This chapter will define the essential part of global self-attention inside the Transformer [15] and hopefully give some intuition for the choices in design. First, we need to introduce some terminology for the different entities required and then define Scaled Dot-Product Attention as the basic block for Multi Head Attention. For the rest of the thesis, we will refer to global attention as attention for brevity. Before finally putting together the Transformer, we also need to go over normalizations, residual connections, and feed-forward layers in Section 2.4.

2.1 What is Attention?

Attention is a concept that is used in modern deep learning models. It is a tool that allows models to learn how different parts of a sequence relate, meaning that different parts can focus or attend to other parts of the data. We often speak of three different tensor entities called Query, Key and Value. Those entities can be obtained by projecting the input data (or input sequence) into different latent spaces, one for each entity. Attention can be modelled as a mapping function from a query and a set of key-value pairs to a real-valued output, which is a weighted sum of the values. To know how much the values contribute to the result, a compatibility function between the query and the keys is computed. We will use the Scaled Dot-Product Attention first introduced in Vaswani et al. [15].

2.2 Scaled Dot-Product Attention

For attention to express how much and where parts of the sequence are related, we need a compatibility function. One such function is the scaled dot-product. Let us denote the dimension of the query vectors and keys as dk and the dimension of value vectors as dv. The attention function is computed over multiple queries at once. The queries are packed into a single matrix n d n d Q R ⇥ k , the same applies for the keys with matrix K R ⇥ k and the values with matrix 2 n dv 2 V R ⇥ , where n is the sequence length. 2 Figure 2.1 shows a diagram of how to compute the scaled dot-product attention. The attention function ﬁrst computes a dot product between all the query and transposed keys, followed by a scaling factor of pdk and a softmax function. According to Vaswani et al. [15] this scaling was crucial to avoid small gradients when applying the softmax function. At this stage, we can apply a mask to constrain the values it is allowed to see. Such a mask would only be required for training autoregressive models. This mask is not necessary for our task. After applying a softmax function we get the weights matrix W that can be thought of as magnitude

3 2. Transformers and Attention 4 of contributions towards the result. The attention’s result then is the multiplication of all the contributions with the values matrix V .

QKT Attention(Q, K, V )=softmax V (2.1) pd ✓ k ◆ W | {z }

Figure 2.1: Diagram of the scaled dot-product attention. Consisting of a matrix multiplication between the queries Q and the keys K followed by a scaling and optional masking for autoregressive models. The result is run through a softmax layer and then multiplied with the values V . Figure adopted from Vaswani et al. [15]

2.3 Multiple Heads

We usually call one attention block head. Instead of using one attention head that operates in one latent space, the authors Vaswani et al. [15] suggest combining H different attention blocks at once. This process is called Multi Head Attention (MHA). Using MHA they could increase their performance and show that the different projections of the queries, keys, and values into different latent spaces represent different features of the input sequence and attend differently. Q K V Thus, in MHA we have H learned linear projections Wi , Wi , Wi . We can then perform attention over each representation h for i [1,H] in parallel: i 2

h = Attention(QW Q,KWK ,VWV ) i [1,H] (2.2) i i i i 8 2

The computed values of every attention head h1,h2,...hH are concatenated and run through a ﬁnal projection W O to combine all the intermediate attention results, visible in Equation (2.3). Figure 2.2 visualizes the diﬀerent operations involved in multi-headed attention.

O MHA(Q, K, V )=Concat(h1,h2,...,hH )W (2.3) 2. Transformers and Attention 5

Figure 2.2: Learning diﬀerent representations by using h diﬀerent scaled dot-product attentions on the linear projected queries Q, keys K and values V . Figure adopted from Vaswani et al. [15]

2.4 Self-Attention Transformer g x N Encodin Positional

X MHA Embedding Add & Norm & Add Norm & Add Feed Forward

Figure 2.3: The Transformer Architecture consisting of stacked encoder layers that rely on the multi-head attention mechanism.

The transformer architecture introduced in Vaswani et al. [15] consists of two components, the encoder and the decoder. For this thesis, we will only use the encoder part of the network, since many encoder-only networks have shown to perform well [17, 24, 28]. A picture of the encoder-only architecture is provided in Figure 2.3. 2. Transformers and Attention 6

Encoder

One encoder block consists of two sublayers, the self-attention module and a feed-forward network, each followed by a residual connection and normalization. One can also do the normalization before the residual connection. To get the full encoder, one simply stacks several encoder blocks on top of each other.

2.4.1 Self-Attention

Self-attention is a more simpliﬁed version of the general multi-head attention. Instead, for the matrices Q, K,andV to be all diﬀerent from each other, we set them all to be the same matrix X, which will denote our input to the model. Hence the attention function is:

SelfAttention(X) = MHA(X, X, X) (2.4)

It is crucial to note here that self-attention is permutation invariant without adding a positional encoding and needs further extension to work on spatial input!

2.4.2 Feed Forward Network

The Transformer architecture [15] depicted in Figure 2.3 requires a Feed Forward network, whose purpose is to process the output of one attention layer in such a way that the next layers can beneﬁt from it. The feed-forward network can be described by two composed linear transformations with an activation function fa in between, i.e. LeakyReLU.

FeedForward(x)=Conv2d(fa(Conv2d(x))) (2.5)

2.4.3 Residual Connection & Normalization

The residual connection and normalization are tightly connected to the sublayers and can be deployed in two diﬀerent modes. Let us denote the sublayer as a function and the input signal F at layer l as xl. We speak of PreNorm if we apply normalization to the input of the sublayer. On the contrary, if we apply normalization after the residual connection we call it PostNorm.Which placement works best must be found out by experiments, for this thesis we stick to PostNorm using LayerNorm as it was introduced in the original Transformer by Vaswani et al. [15]. 2. Transformers and Attention 7

Figure 2.4: Depiction of the diﬀerent normalization placements. PreNorm applies the normalization to the input to the submodules, whereas PostNorm normalizes the input after the residual connection. Figure adopted from Wang et al. [29]. Chapter 3 Structured Attention

One major issue of attention introduced in Chapter 2 is the loss of spatial relations of the data due to ﬂattening the input. For one-dimensional data it does not matter since it follows naturally. Though, treating images or higher-dimensional data as sequences by ﬂattening them to a one-dimensional tensor breaks spatial dependencies. Also, we want to emphasize again that the Transformer is permutation invariant and hence needs some extensions to handle spatial data. In this chapter, we want to motivate some techniques, how we might be able to keep the spatial relations during attention.

3.1 Axial Attention

A natural candidate for high-dimensional data and attention are axial transformers, as introduced in the paper by Ho et al. [23] or extended later in Wang et al. [22]. Instead of applying global self-attention, those models compute attention over a single axis at a time. This has many advantages: First, we don’t need to ﬂatten the input tensor. Second, we beneﬁt from less computational and memory cost than vanilla self-attention, because the dimension of a single axis is usually way less than the dimension of the input tensor.

Figure 3.1: Schema of axial attention performed over each dimension separately. Image shows operations for a 2D image, where we compute attention ﬁrst over the height followed by attention over the width. Figure adopted from Wang et al. [22]

Let us consider an image of N = H W pixels. Axial attention has a computational and ⇥ memory cost of (W H2 + H W 2), because we ﬁrst compute attention over W sequences of O · · size H. Analogously, computing attention over the width involves H sequences each of size W . This saves us (max(H, W)) or (pN) for a square image compared to standard self-attention. O O Generally for a tensor of dimension d with equally large dimensions of size S and N = Sd many pixels, we are able to save a factor of (N (d 1)/d). O 8 3. Structured Attention 9

For Axial Transformers to work, we simply stack many layers of attention computed over all the available axes to gain a full receptive ﬁeld. Hence, we inherently rely on the fact that eventually, all the pixels can propagate information to any other pixel, similar to graph attention [30]. In axial attention every pixel can propagate its information after at most d many hops, where d is the number of axes of the input tensor. A sample PyTorch implementation can be found on GitHub1.

3.2 Global Self-Attention (GSA) Module

The Global Self Attention (GSA) module has first been introduced in Shen et al. [27]andis based on the MHA mechanism from Chapter 2. MHA relies on axial attention but introduces a few changes. Every output pixel is computed by combining spatial and content information from every input pixel. Let us denote the spatial dimensions of the data as W and H, as well as the input dimensions din and output dimensions dout. The number of total pixels then is N = H W . i WH d · The input features can then be defined as F R ⇥ in with the corresponding output features o WH dout 2 as F R ⇥ . 2 The module shown in Figure 3.2 consists of two parallel attention layers called Content Attention and Positional Attention. As the name suggests, the first layer aims to learn content-based attention maps, whereas the second layer learns features based on spatial positions.

Figure 3.2: Inner workings of a GSA module. First project input sequence into keys, queries and values. In parallel compute content and positional attention. Content attention combines the keys and values and is hence linear in the input sequence. Positional attention works like axial attention on the last two axes and also adds learned relative positional embeddings to columns and rows. Figure adopted from Shen et al. [27]

3.2.1 Content Attention

Like the standard attention in Chapter 2, we first project the input sequence onto keys K, queries Q, and values V by using a convolution with kernel size 1. We can then compute the output feature map F c by first applying a softmax to each transposed key row and then multiplying the result with the values, resulting in an intermediate tensor of size d d . This has the huge k ⇥ out benefit of avoiding the squared computational complexity when combining this tensor with the queries to obtain the final output feature map F c. Hence the resources required by this layer are of order (N). O 1https://github.com/lucidrains/axial-attention 3. Structured Attention 10

c T WH dout F = Q ⇢ K V R ⇥ (3.1) 2 Shen et al. [27] state that this attention method can be thought of ﬁrst gathering the features T in the value matrix V into dk global context vectors using the weights of ⇢(K ). Then the values are redistributed to single pixels using weights in the query matrix Q. The authors also reported a signiﬁcant performance drop of 1% on top-1 accuracy on ImageNet using softmax normalization on the queries. One possible explanation could be that normalizing the queries via a softmax operation constrains the features to be a convex combination of the global context vectors, which might be a too hard constraint. Hence we will also not make use of it in this thesis.

3.2.2 Positional Attention

This additional layer is needed to incorporate the spatial information into the attention since the content attention layer is equivariant under pixel shuffles. Positional attention is inspired by axial attention and first computes attention over the columns only, followed by optional batch normalization and a final attention layer over the rows. But instead of only computing the attention, we also apply a learned relative positional embedding to it in each step. This learned embedding is based on a parameter L, which defines a local neighbourhood’s size around each pixel. Hence, every output pixel can be computed by combining contributions of every pixel in a relative neighbourhood of size L L. ⇥ L 1 L 1 Given L, the neighborhood can be defined by a set of fixed offsets = 2 , , 0, , 2 . c L d { ··· ··· } We will now introduce the learned positional embeddings R R ⇥ k and describe the output 2 of the positional attention layer after one column-wise pass, since the row attention part is done c L dout analogously. For this we also need to define a new matrix V R ⇥ that denotes the L i,j 2 neighboring pixel values in the same column of pixel (i, j). Additionally let the query for the c pixel (i, j) be qi,j. Then the output of the positional attention fi,j can be written as follows:

c cT c fi,j =(qi,jR )Vi,j (3.2)

The modules enable us to apply a batch normalization between the content and row attention, which we did not use in our experiments. Let’s call the output of the optional batch normalization c bi,j. The row attention part can be deﬁned analogously to Equation (3.2).

r c rT r fi,j =(bi,jR )Vi,j (3.3)

As we have seen in Section 3.1, this layer requires (N pN) in terms of memory and computation. O · Chapter 4 Data and Baselines

The performance of a deep learning model is tightly coupled to the dataset and training procedure that has been used. This chapter explains where our data comes from and what data collection we use for training. Further, we introduce the WeatherBench benchmark that serves as a comparison foundation for models on weather prediction. By doing so, it introduces a weighted root mean squared error metric to score each model on. Additionally, we elaborate on applied transformations to the data that beneﬁt the training process and highlight known problems and open data issues.

4.1 ERA5

The data used to train our models is provided by the European Center for Medium-Range Weather Forecasts (ECMWF). It is based on their ﬁfth generation reanalysis data set ECMWF Reanalysis (ERA5) [31] generated using 4D-Var data assimilation and model forecasts in CY41R2 of ECMWF’s Integrated Forecast System (IFS). The weather data spans from 1979 to 2018 and contains hourly ﬁeld estimates of over 300 parameters available at 137 pressure levels.

To be aligned with the baselines provided in the WeatherBench, we will only use a subset of the ERA5 data. The WeatherBench repository provides additional information on where and how to download the hosted data. Using this dataset, we make the basic assumption that any meteorological biases in weather patterns over the years 1979-2018 are negligible.

4.2 WeatherBench

WeatherBench is a benchmark data set for data-driven weather forecasting that has been published in [32], with the goal to lay out a foundation for new data-driven methods. The data set is based on ERA5 data and oﬀers three resolutions 1.40625°, 2.8125° and 5.625°. WeatherBench only covers a subset of all the ERA5 variables and pressure levels. A list of all the contained variables of WeatherBench is provided in Table 4.2. The baselines include physical simulations and recent deep learning results run on the same data, shown in Table 4.1.

The data repository provides evaluations on z500 (geopotential at 500 hPa), t850 (temperature at 850 hPa), t2m (temperature at 2m above ground) and pr (precipitation) for 3 and 5 days lead time. For this work we will focus only on predicting z500, t850 and t2m.

First we have to deﬁne the root mean squard error RMSE. Let the output be of size H W with N H W ⇥ Nlat latitude and Nlat longitude grid points, let P R forecasts⇥ ⇥ be the model’s prediction 2 11 4. Data and Baselines 12

N H W and let G R forecasts⇥ ⇥ be the ground truth over Nforecasts many forecasts. The RMSE can 2 be deﬁned as:

N N N 1 forecasts 1 lat lon RMSE(P, G)= (P G ) (4.1) N vN N i,j,k i,j,k forecasts i u lat lon j k X u · X X t

These scores in the benchmark have been computed by using a latitude weighted RMSE (RMSElat) metric. The weighting is needed to accommodate for the diﬀerent sizes of the grid squares. Using an even weighting will result in too much emphasis on the poles, since the grid squares near the poles are much denser. The metric is deﬁned as follows:

N N N 1 forecasts 1 lat lon RMSElat(P, G)= L(j) (Pi,j,k Gi,j,k) (4.2) N vN Nlon · forecasts i u lat j k X u · X X t L(j) is the mentioned weighting factor depending on the latitude index j and can be formulated as

cos(latj) L(j)= (4.3) 1 Nlat cos(latj) Nlat j P 4.2.1 Baselines

The WeatherBench baselines listed in Table 4.1 all used the same data. They were run on the same resolution of 5.625°, except the numerical weather models, which were run at coarser resolutions. The baselines can be categorized into four diﬀerent types of models.

Table 4.1: Table contains NWP models run at diﬀerent resolutions. Additionally it shows baselines and recent deep learning models scored on the same data contained in WeatherBench using the RMSElat metric on 5.625° resolution.

Model z500 (3/5 d) t850 (3/5 d) t2m (3/5 d) Persistence 936 / 1033 4.23 / 4.56 3.00 / 3.27 Weekly Climatology 816 3.50 6.07 IFS T42 489 / 743 3.09 / 3.83 3.21 / 3.69 IFS T63 268 / 463 1.85 / 2.52 2.04 / 2.44 Operational IFS 154 / 334 1.36 / 2.03 1.35 / 1.77 UNet Weyn et al. [33]373/6111.98/2.87- ResNet19 Direct (ERA only) [2]314/5611.79/2.821.53/2.32 ResNet19 Direct (pretrained) [2] 268 /523 1.65 /2.52 1.42/2.03 ResNet19 Continuous (ERA only) [2]331/5451.87/2.571.60/2.06 ResNet19 Continuous (pretrained) [2]284/499 1.72 / 2.41 1.48 / 1.92

Persistence is one of the simplest forecasting models, which assumes that the current weather conﬁguration persists over the following days. 4. Data and Baselines 13

Weekly Climatology is obtained by simply averaging all the diﬀerent weeks over time. The computed averages of the weeks contain the seasonal cycles and outperform the Persistence baseline.

Numerical Weather Prediction IFS (Integrated Forecast System) models are currently the standard for medium-range numerical weather predictions. Such models include the operational IFS model of ECMWF and IFS models run at coarser resolutions. IFS T42 uses roughly a resolution of 2.8° and 62 vertical levels, whereas IFS T63 operates at a resolution of 1.9° and ⇠ ⇠ 137 vertical levels. These coarser models were computed to provide a means of comparison in terms of similar computational resources a deep learning model might have.

Deep Learning The set of baselines also include recent deep learning models scored on WeatherBench dataset, like the U-Net of Weyn et al. [33]. The authors tried to reduce the distortions of the data that has been projected onto a 2D grid by mapping the grid onto a cubed-sphere. The currently best data-driven model for both direct and continuous predictions by Rasp and Thuerey [2] uses a pretrained ResNet19.

Table 4.2: Table of variables provided in WeatherBench dataset. Some variables are available at 11 diﬀerent pressure levels, others are only available on a single pressure level or are constants. Data and description from WeatherBench [32].

VariableName Symbol Description Unit Levels Temperature t Temperature [K]13 2 2 Geopotential z Proportional to height of a [m s ]13 pressure level Relative humidity r Humidity relative to [%]13 saturation 1 Specific humidity q Mixing ratio of water vapor [kg kg ]13 1 Eastward wind u [ms ]13 1 Northward wind v [ms ]13 1 Vorticity (relative) vo Relative horizontal vorticity [s ]13 2 1 1 Potential vorticity pv Potential vorticity [Km kg s ]13 2m temperature t2m Temperature [K]1 1 10muwind component u10 [ms ]1 1 10mvwind component v10 [ms ]1 Total precipitation tp Hourly precipitation [m]1 Total cloud cover tcc Fractional cloud cover [0, 1] 1 2 Incoming solar radiation tisr Accumulated hourly incident [Jm ]1 solar radiation Orography oro Height of surface [m]1 Land sea mask lsm Land-sea binary mask [0, 1] 1 Soil type slt Soil-type categories - 1 Latitude lat2d 2D field with latitude at every [°]1 grid point Longitude lon2d 2D field with longitude at [°]1 every grid point 4. Data and Baselines 14

4.3 Normalization

A ﬁrst look at a partial histogram excerpt of the data in Figure 4.1 reveals that not all variables follow a normal distribution. Although using Z-Normalization, that enforces a normal distribution on the data, the resulting distribution is not necessarily normal. Since the authors Rasp and Thuerey [2] have achieved good scores using that normalization scheme, we also want to keep it for our experiments.

To prevent information leakage to the test and validation data, the normalization were computed ch pl h w looking only at the train data. Let us denote the input tensor as D R ⇥ ⇥ ⇥ .Weprovide 2 the following data normalizations:

Z-Normalization (Standardization)

The Z-Normalization, or often simply referred to as Standardization,isatoolthatputsthedata into the same scale. By doing so, we standardize the distribution to zero mean and variance ch pl 1. Thus the required terms we need to compute are the mean Dµ R ⇥ and the standard ch pl 2 deviation D R ⇥ per pressure level and channel over all desired time steps. The transformed 2 distribution still has the same shape, but its values and range are scaled, which can be seen in ˆ ch pl h w Figure 4.2. The normalized tensor D R ⇥ ⇥ ⇥ then is: 2

D Dµ Dˆ = (4.4) D

Figure 4.2: Histogram of distribution of t850 after standardization. The Distributions still has the same shape but is a rescaled version of the original t850 distribution.

Min-Max Scaling

ch pl Min-Max scaling involves ﬁnding ﬁnd the minimum Dmin R ⇥ and maximum Dmax ch pl 2 ˆ ch pl h w 2 R ⇥ per pressure level and channel and obtain the normalized tensor D R ⇥ ⇥ ⇥ by 2 Equation (4.5). Min-Max scaling can be thought of as a special case of Standardization with 4. Data and Baselines 15 mean µ = D and = D D . It has the useful property that the normalized values min max min are always in the range [0, 1] for training. A sample histogram for the distribution after the normalization is visible in Figure 4.3.

D Dmin Dˆ = (4.5) D D max min

Figure 4.3: Histogram of distribution of t850 after Min-Max normalization. Note, that the x-axis is scaled but the distribution shape stays the same.

Local Area Standardization (LAS)

One drawback of using one of the above-mentioned normalization schemes is that they normalize every pixel equally, independent of the location. This can cause very local changes to get lost. However, computing a normalization for every pixel also doesn’t cut it, because the normalized values might not be locally coherent anymore and be non-smooth. Local Area Standardization,by Grönquist et al. [34], is a mixture of both ideas and computes the mean and standard deviation of every pixel inside a window of fixed size over time. The first step to compute this norm is padding the input. We pad the input periodically in the longitude direction and pad it in the latitude direction by repeating the last value at the border. We then apply two filters of size k k over each channel, which each compute the mean and the standard deviation inside this ⇥ window. For our experiments we define the filter size to be k =5. Finally, we apply a periodic padding with zero padding in the latitude direction to the data and apply a Gaussian filter with distribution (0, 5) on both outputs. Figure 4.4 provides a visualization of this procedure. The N Gaussian blur is not mandatory but smooths out rapidly changing values, but it might also tamper with the original signal.

We hope to alleviate some work oﬀ the model by providing a normalization that already captures some spatial local diﬀerences provided in the mean and standard deviations per pixel. Figure 4.5 shows the computed standardization for mean and standard deviation on the variable t850. 4. Data and Baselines 16

Figure 4.4: Visualization of Local Area Standardization by computing moving mean and moving standard deviation over each channel. Finally the intermediate result is padded and we run a Gaussian ﬁlter over it. Figure from Grönquist et al. [34]

Figure 4.5: Mean and standard deviations of LAS computed for variable t850 for years 1979-2015 of ERA5. 4. Data and Baselines 17

Figure 4.1: Histogram on partial data including variables z, t, t2m, u, v, q on pressure levels 850 hPa, 500 hPa and 100 hPa. The last row shows variables t2m, tcc and tp that are only available on one pressure level. 4. Data and Baselines 18

4.4 Data Subset

For our experiments in Chapter 6 we only used a subset of the data speciﬁed in Table 4.3.To the existing variables, we add a new variable aggr_tp, which stands for the 6-hour aggregated total precipitation, as is recommended in Rasp and Thuerey [2].

Table 4.3: List of all the variables contained in the subset that are used for training.

Variable Name Symbol Temperature t Geopotential z Speciﬁc humidity q Eastward wind u Northward wind v Potential vorticity pv 2m temperature t2m Total precipitation tp Total cloud cover tcc Incoming solar radiation tisr Orography oro Land sea mask lsm Soil type slt Latitude lat2d

4.5 Transformations

Log-Transform Precipitation

As we can see in Figure 4.1, the distribution for the total precipitation tp is quite skewed, which does not improve by taking the 6-hour accumulation. Following the suggestions from [2]wealso log transform the distribution of the 6-hour accumulated aggr_tp with ✏ =0.001, as follows:

aggr_tpˆ = log(✏ + aggr_tp) log(✏) (4.6)

Subtracting ✏ from the log transform ensures no initial zero value is non-zero after. The resulting distribution can be seen in Figure 4.6. Note that after this transformation, the 6-hour accumulated total precipitation is still the variable with the most skewed distribution. 4. Data and Baselines 19

Figure 4.6: Histogram of 6-hour accumulated total precipitation aggr_tp on the left and the log-transformed distribution on the right.

Variable Data Shape

We need to address the issue that not all data has the same shape. There are many variables such as t2m, tp or all the wind related components at 10m that are only available on one pressure level. Inspired by the paper of Rasp and Thuerey [2] we combine the pressure level and variable dimension into one dimension, which we call plvars. For instance the variable z of shape (time, var, pl, h, w) is transformed into shape (time, plvars, h, w), where time denotes the number of time slices we give the model during training. We picked three time slices (time =3) at -12h,-6h,and0h for the model to train on. At this point we have the option to leave the time dimension untouched or stack it along the plvars dimension. For the sake of this thesis we will stick to stacking the time slices along the plvars dimension, but also provide one experiment on future work in Section 6.4. The implications of doing this are discussed in Section 6.7.

4.6 Known Dataset Inconveniences

Missing Data

Because of how the ERA5 dataset has been generated, the ﬁrst six time slices contain NaNsin the variables tp and tisr. Additionally, since we compute the 6-hour accumulated precipitation, three more entries cannot be used. This leads to the ﬁrst nine time steps that contain NaNsin tp and tisr, which we simply discard before training.

Cold Bias 2000-2006

The Copernicus Climate Change Service (C3S) at ECMWF has published a new version of ERA called ERA5.1 [35], that addresses the issue of a cold bias that was noticed in lower parts of the stratosphere for the years 2000-2006 in ERA5. Using the new dataset, we can expect a more accurate representation of temperature and humidity in lower heights. 4. Data and Baselines 20

ERA5 CDS Data Corruption

Unfortunately, three weeks before the end of this thesis the ECMWF has announced that they have found 361 damaged fields out of 3.1B available fields in the data1. This means that every 1 out of 8.6 million fields is corrupted on average. The corruption can be seen in Figure 4.7 and is noticeable due to a horizontal line whose values are the minimum value of the field. Every user who has downloaded the data prior to 2021-04-15 is affected and this also affects the WeatherBench benchmark.

Figure 4.7: Example image of corrupted data inside ERA5 for some ﬁelds. Image taken from [36].

1Information published on https://conﬂuence.ecmwf.int/display/CKB/ERA5+CDS%3A+Data+corruption Chapter 5 Models

This chapter describes the model architectures chosen for our experiments in Chapter 6. We will first explain the common network architecture and the custom Periodic Padding layers, since they are the same in all specific model architectures. We then shift our main focus onto the GSA-(Res)Net architecture in Section 5.2 representing a mixture of ResNet blocks and GSA blocks. It is parameterized so that we can get a ResNet -only network or a GSA-only network from it, which will be useful during the comparisons between the different models.

5.1 General Network Architecture

The general network architecture is depicted in Figure 5.1. It consists of a periodic padding followed by a 7 7 Conv2d that reduces the number of channels to the desired hidden dimension, ⇥ e.g. dim_hidden=128. We then apply a batch norm, activation function fact and dropout in this order, before feeding it to the network at hand. We again apply a periodic padding to the processed network output and reduce the output to three output channels.

2D Positional Encoding

This step is inspired by the positional encoding used in the Transformer architecture [15], whose purpose is to break the permutation invariance of the attention operation. Right after reducing the channels to hidden_dim number of channels, we add a fixed encoding PE to the signal. This should help the model to spatially differentiate pixels (x, y) on different channels and overcome the permutation invariance property of the Transformer. Let i, j [0,D/4). Equations for the 2 positional encoding are provided in Equation (5.1). A sample implementation can be found on GitHub1.

PE(x, y, 2i)=sin x/10000(4i/hidden_dim) (5.1) ⇣ ⌘ PE(x, y, 2i + 1) = cos x/10000(4i/hidden_dim) (5.2) ⇣ ⌘ PE(x, y, 2j + hidden_dim/2) = sin y/10000(4j/hidden_dim) (5.3) ⇣ ⌘ PE(x, y, 2j +1+hidden_dim/2) = cos y/10000(4j/hidden_dim) (5.4)

1https://github.com/tatp22/multidim-positional-encoding ⇣ ⌘

21 5. Models 22

Embedding

The original Transformer architecture by Vaswani et al. [15]inSection 2.4 requires an input embedding that has not yet been addressed in this thesis. In terms of the Transformer on NLP or image tasks, the values can be thought of as tokens and can be embedded in a latent space. Since we have many values of different meanings and different scale, e.g. temperature, geopotential or precipitation, it is not obvious to us how to work around this. Also, adding an extra dimension and keeping the channel dimension intact changes the shape of the data to (batch, embedding, ch, H, W). Thus forcing us to either stack the embedding and channel dimensions or use 3D convolutions for the ResNet and channel reduction parts. We are unsure if this is the way to go and leave this point open. For the sake of this thesis we will treat the channel dimension as our embedding dimension, which has the benefit of having fewer parameters and having similar ResNet architecture and data shape as in [2] for comparisons. In our models we can think of the learned filters of the ResNet as our embedding or treat the whole encoder stack of the Transformer as one.

32

32 Pad BN ReLU Drop Pad PE 32 64 Conv 3x3

Network 128 64 Conv 7x7

Figure 5.1: Depiction of the individual modules involved in our common model architecture. Input is periodically padded and its channels reduced to hidden_dim=128, followed by a ﬁxed positional encoding. The intermediate result is then fed to the network, whose result is again padded and reduced to three output channels.

Activation Functions

Our experiments will use one of the following activation functions. Both activation functions LeakyReLU [37]andPReLU [38] are standard activation functions commonly used in CNNs. The diﬀerence between them is that PReLU involves a scaling factor ↵i that can be learned per position i.Ifai = c i for some constant c, then it is equal to LeakyReLU. 8

z, z > 0 LeakyReLU(z)= (5.5) (↵z, otherwise 5. Models 23

zi,zi > 0 PReLU(zi)= (5.6) (↵izi, otherwise

Periodic Padding

As proposed in [2], we apply a periodic padding in the longitude direction and zero-pad the input in the latitude direction. This should help the model to learn that the data wraps around the globe in the longitude direction. Figure 5.2 shows the padding procedure for pad size 5 on a randomly picked temperature slice.

Figure 5.2: Unpadded input (left) and padded input (right) for pad size 5.

5.2 GSA-(Res)Net Forecasters

Our network architecture, shown in Figure 5.3, aims at combining the two ideas of both augmenting an established network with attention and also completely replacing convolutions with attention. To achieve this ﬂexibility we introduce two parameters k and r, which determine the number of ResNet blocks r followed by k GSA blocks. Obviously, if we set k =0,r = 28, then we get our adaptation of a conventional ResNet28. Note, that as of now our architecture does not allow to have GSA blocks in the middle of a ResNet, i.e. having the sequence ResNet14 - GSA8 - ResNet14 of modules. This is on future thought that will be discussed in Section 8.2.Forthe sake of this thesis we will investigate the following conﬁgurations:

• k =0,r= 28 ResNet28

• k =8,r= 20 ResNet28 with the last 8 blocks replaced by GSA blocks.

• k =8,r=0 Network with 8 GSA blocks.

• k = 28,r=0 Network with 28 GSA blocks.

We also want the model to be able to disregard any contribution of the Transformer, which we do by wrapping the Transformer network part inside a residual connection. Such we can ﬁnd 5. Models 24 out how much and where the Transformer contributes to the prediction compared to the ResNet backbone, if it has one.

+

Adapted ResNet Block r GSA Net Block k ⇥ ⇥ Network

Figure 5.3: Detailed view of network part (green block) in Figure 5.1. The model consists of r ResNet blocks followed by k GSA blocks, all wrapped inside a skip connection to eventually cancel out the Transformer altogether.

Hence the network has two distinct network parts, whose blocks need further explaining.

5.2.1 Adapted ResNet Block

Our adapted ResNet block consists of two sequential residual blocks. Every block applies a periodic padding in the longitude direction before running the input through the convolutions. A batch normalization and activation function follow, to which we ﬁnally apply some dropout and adding the result back to the original signal. A visualization of such a block can be found in Figure 5.4.

+ 32 32

Pad BN Fact Drop Pad BN Fact Drop

128 64 128 64 Conv 3x3 Conv 3x3

Adapted ResNet Block

Figure 5.4: Schema of an adapted ResNet block that simply wraps two sequential convolutional blocks inside a residual connection. Stacking these adapted blocks enough times yields the yellow network part in Figure 5.3. 5. Models 25

5.2.2 GSA Block

The GSA block contains the basic GSA module presented in Section 3.2,butwealsowantto enhance the standard GSA module by adding a Latitude Weighting and Aﬃne Weighting for more ﬂexibility and help to guide the model’s attention. Both custom layers are optional and are denoted in round brackets in Figure 5.5.

K: Conv 1x1

Content Attention

+ 32 32

Input (Softmax)

128 64 128 64 W lat: Q: Conv F o: Conv A: (Latitude 1x1 1x1 (AneLayer) Weighting)

Positional Attention

V : Conv 1x1

GSA Block

Figure 5.5: Picture of GSA block that consist of the introduced GSA module in Section 3.2 and enhances it with a latitude weighted layer and an aﬃne layer.

Latitude Weighting Wlat

Distortions due to the data projections onto the 2D grid cannot be avoided. Although we can train our networks by using a loss function that incorporates measures to weight the errors accordingly, i.e. L1lat or MSElat, the attention mechanism might not know that it should attend less to pixels near the poles. Our idea is to weight the computed attention by the latitude areas, similar as has been done in Equation (4.2).

attn ch N N Let F R ⇥ lat⇥ lon denote the result of the sum of the Content Attention and Positional 2 Attention. We can then multiply each channel by the latitude weights in Equation (4.3). Before multiplying the latitude weights we need to reshape the data, such that the latitude data points are the last dimension. Afterwards we then permute the tensor back to it’s original shape. For attnˆ ch Nlon Nlat simplicity we will only provide the formula on the already permuted matrix F R ⇥ ⇥ . 2

Fˆlat = F attnˆ L(j) j (5.7) c,i,j c,i,j · 8 5. Models 26

Aﬃne Layer Waﬃne

This layer can be thought of a generalization to Latitude Weighting. Instead of constraining the model with a fixed set of weights to apply, we let the model learn a set of affine weights for each o ch N N channel and pixel. Let F R ⇥ lat⇥ lon be the result of a GSA module and let the affine layer 2 ch N N ch N N Waffine be described by a weights matrix A R ⇥ lat⇥ lon and bias matrix B R ⇥ lat⇥ lon . 2 2 The output can then be defined as follows:

AffineLayer(F o)=AF o + B (5.8) Chapter 6 Experiments on WeatherBench

The main focus of this chapter are the results of our experiments in Section 6.4. We compare our models to existing baselines and also investigate the performance on three handpicked weather events of the last 30 years, e.g. hurricane Katrina in 2005. We provide visualisations for the learned embeddings and attention heads for each event to determine what the model is focusing on. We are also interested in the stability of the models - how they behave under slight changes to the input, which is addressed in Section 6.6. Before diving into the numbers we want to explain as precisely as possible how the experiments are set up in Section 6.1 and also deﬁne how our models predict the state of the weather into the future in Section 6.2. It is also essential that we shed light on how the models were trained, which is done in Section 6.3. Finally, we discuss the obtained results in Section 6.7.

6.1 Setup

Hardware

Due to many experiments and large training times, our experiments were carried out on two diﬀerent nodes provided by CSCS. This involved node ault06 with 4 NVIDIA A100 SXM4 40 ⇥ GB GPUs,128AMD EPYC 7742 64-Core Processors and 512 GB of shared memory and ault25 with 4 NVIDIA V100 32 GB,72AMD EPYC 7742 64-Core Processors and 726 GB of shared ⇥ memory.

Data Loading

We use PyTorch’s [39] DataLoader class for facilitating efficient and simple data loading. The original xarray dataset is 304 GB and does not fit into memory on most machines. We have spent a lot of time in the first third of this thesis to find the fastest strategy to load the data. The options included having one NumPy file for each day (300k+ files), loading from xarray directly or having single big NumPy file for train, validation and test data. Unfortunately, loading the data directly from xarray [40] as well as splitting the dataset into many NumPy files was too slow, due to the overhead of the file loads. It worked best to convert the different data subsets (train, validation, test) to one big, memory-mapped file, loaded to shared memory before training. The converted dataset amounts to roughly 132 GB and nicely fits into memory on all nodes. The size reduction comes from using only a subset of the data (Section 4.4)andstoringitinNumPy data format. Since we will enable data shuffling inside the DataLoader during training, the accessed indices are random. With NumPy’s option to load a file in memory-mapped mode, random accesses remain

27 6. Experiments on WeatherBench 28 fast, since the whole data resides in memory. This limits us in using PyTorch’s DataParallel to distribute the load evenly among the available GPUs, as DistributedDataParallel would load the whole data once per process and overﬂows the memory.

Runs, Logs and Checkpoints

All the runs and logs, including plots during training, were managed with Weights & Biases (W&B)[41]. To make sure we have a copy of the checkpoints in the worst case, we not only keep a local copy of the best checkpoints, but we also upload the training script and model code, including the best and latest checkpoint to W&B. The best and latest checkpoints are updated every epoch and every 800 iterations. The best weights are also available through our repository1.

Since we want to have the freedom to try out diﬀerent normalizations and not generate a new data set per normalization option, we normalize the data before feeding it into the model. This ﬂexibility incurs a small performance loss due to the computational overhead of the normalization.

6.2 Predictions

Prediction Type There are three prediction types:

1. Direct: The model is trained and evaluated to always predict the state at exactly t + , where represents the lead time, i.e. =3days.

2. Continuous: The model is trained to predict the state at t + but the lead time is given as an input parameter. Ideally, the model should learn to do predictions for lead times in the range [0, ] for a randomly chosen . During evaluation, we can then ﬁx to a desired value in the range the model has been trained on (i.e. 3 days).

3. Iterative: The models are trained to predict the state for smaller time steps, i.e. =1 hour into the future. During evaluation, one has to repeatedly apply the model to the intermediate states to get to the desired forecast time, like in a standard autoregressive setting.

What prediction mode to use depends on the task at hand and what properties we want to exploit. One obvious drawback of the direct approach compared to continuous and iterative models is that one model per desired lead time must be trained. However, direct models can learn more specific characteristics for the given lead time. Non-direct models have the advantage that arbitrary forecast times can be chosen, but certain limitations exist. Due to the feedback like nature of the iterative models, like the one in Weyn et al. [33], errors can propagate very rapidly and may lead to huge errors over time. Similarly, if one predicts beyond the training forecast range of a continuous model we might get large errors. It is therefore a trade-off between specificity, generalization and numerical stability. For the sake of this thesis we constrain ourselves to continuous forecasts to remain flexible concerning lead time, stability and evaluation time.

1https://spclgitlab.ethz.ch/deep-weather/structured-weather-transformer 6. Experiments on WeatherBench 29

6.3 Training

The models were trained with PyTorch [39] version 1.7.1.

Loss Functions

We enhance the loss functions already available in PyTorch with the following functions below. N Let Nlat denote the number of latitude points on the grid and let wlat R lat denote the latitude 2 N C N N weights. We then deﬁne the model prediction and ground truth to be P R forecasts⇥ ⇥ lat⇥ lon N C N N 2 and G R forecasts⇥ ⇥ lat⇥ lon over Nforecasts many forecasts and C number of output channels. 2 Then the following loss functions give the resulting loss on a per channel basis c.Ifwewanta scalar loss we can simply aggregate all the channels by taking the mean over the channels.

N N N 1 forecasts 1 lat lon L1lat(P, G, c)= Pk,c,i,j Gk,c,i,j wlat (6.1) Nforecasts 0Nlat Nlon | |· 1 Xk · Xi Xj @ A

Nforecasts Nlat Nlon 1 1 2 MSElat(P, G, c)= (Pk,c,i,j Gk,c,i,j) wlat (6.2) Nforecasts 0Nlat Nlon · 1 Xk · Xi Xj @ A

Optimizer

During all experiments we use the Adam optimizer with betas =(0.9, 0.98) and a weight decay factor of 1 10 5. The same amount of weight decay is applied to every layer of the network. ⇥

Mixed Precision

We were unsuccessful in using native mixed-precision oﬀered by PyTorch to increase our performance further. The issue could be tracked down to the last convolutional layer inside the GSA module, which maps the intermediate dimensions to the output dimensions. We are not completely sure what exactly happens, but the result using float16 eventually results in NaN. This issue was not investigated further due to time constraints.

Batches, Shuﬄe, Split

We use the same data split proposed in Rasp et al. [42], which separates the data into the following three non-overlapping sets:

• Train Set: Covers the years [1979,2016).

• Validation Set: Covers the years [2016,2016).

• Test Set: Covers the years [2017, 2019). 6. Experiments on WeatherBench 30

We also perform random shuﬄing of the data before handing it to the model. This is only done during training. Upon evaluation, the order of the data is preserved such that the input data used for visualizations and plots are always performed on the same day among diﬀerent models.

Initialization

We initialize the weights of Conv2ds with Xavier Uniform [43] normalization and set the bias to constant 0. din and dout correspond to the number of input dimensions (or nodes) and number of output dimensions (or nodes) of the layer. The weights of the Conv2d that maps an input tensor from din channels to dout channels are then sampled from the distribution given in Equation (6.3).

Also we initialize the scale of BatchNorm and GroupNorm to 1 and the bias (shift) to 0. Every other layer in the network uses the default initialization provided by PyTorch.

p6 p6 W xavier-uniform , (6.3) ⇠ U "pdin + dout pdin + dout #

Learning Rate Schedule

We kept our learning rate schedule similar to the one proposed in Rasp and Thuerey [2]. We use a ReduceOnPlateau scheduler with factor =0.5 and patience =0, meaning that we half the learning-rate after every epoch, if the validation or test loss did not improve. The initial learning rate was chosen to be 1 10 4, because it yielded good scores and learning rates smaller than ⇥ 1 10 5 resulted in very slow convergence of the models. We believe that the learning rate can ⇥ be further tweaked.

Early Stopping

We consider the model to be converged if the minimum of the validation or test loss does not improve for three consecutive epochs. With the learning schedule above this is equivalent to three learning rate decreases in sequence.

Remarks (Dropout, Training Time)

No dropout led to better validation scores, so we don’t apply it. On the described nodes we achieve a performance of around 1ep/h. The number of epochs required for a model to converge varies greatly, depending on the model’s size. Our larger models usually require 30 epochs or aminimumof30h per experiment.

6.4 Results

The following Table 6.2 shows the results on the test set for variables z500, t850 and t2m for 3 and 5 days lead time for several variations in parameter. Due to space eﬃciency reasons, we had to abbreviate some of the parameters. The abbreviations and descriptions of the parameters 6. Experiments on WeatherBench 31 are listed in Table 6.1. Our best model compared to the WeatherBench baselines is presented in Table 6.3.

We have also conducted some experiments on Lambda Networks [44], Attention Augmented Convolutions [21], and DenseNets [45] in an earlier phase of this thesis, but we did not dive deeper into those methods, due to slow training performance and bad initial results. At that time the models only predicted t850 for 3 days lead time and got stuck at a RMSElat test score of 2.10. We could increase the training speed by using a linear attention method that combines the values and keys ﬁrst, but we observed no decrease in the test loss.

Remark: Due to unforeseen incidents in the AULT cluster during the last weeks of this thesis, we were not able to run the models until full convergence. We would have required roughly 1.5 more weeks to complete all our experiments.

Table 6.1: Abbreviations for parameters listed in Table 6.2

Abbreviation Meaning L Size of the relative positional embedding in GSA modules bs Batch size w Use latitude weighing inside GSA module a Use aﬃne layer after GSA module k Number of gsa modules r Number of ResNet layers used Eps. Number of epochs the model was trained Params Number of parameters of the model

Table 6.2: Experiments on WeatherBench using a simple 5 layer CNN5 [42], GSA, GSA-ResNet and ResNet model types for 3 and 5 days lead time predicting z500, t850 and t2m.

#Ldim bs norm activation w a loss k r z500 (3/5 d) t850 (3/5 d) t2m (3/5 d) Eps. Params

CNN5 [ 42 ] 11 -12864standardLeakyReLU --MSElat --634.59 / 757.44 2.87 / 3.31 2.79 / 3.03 29 906K

05 312832standardLeakyReLU 77MSElat 80349.84 / 627.03 1.90 / 2.81 1.59 / 2.17 25 11.4M 06 64 128 32 standard LeakyReLU 77L1lat 80324.48 / 573.77 1.83 / 2.69 1.53 / 2.08 27 11.5M 07 325632standardLeakyReLU 37MSElat 80346.06 / 599.11 1.89 / 2.73 1.60 / 2.14 20 41.7M

GSA-Nets 08 325632standardLeakyReLU 37L1lat 80326.32 / 586.05 1.83 / 2.69 1.54 / 2.13 14 41.7M

01 312832standardLeakyReLU 77MSElat 820329.27 / 553.54 1.85 / 2.60 1.59 / 2.05 20 17.3M 02 64 128 32 standard LeakyReLU 77L1lat 820310.80 / 533.92 1.79 / 2.55 1.52 / 2.04 20 17.4M 03 325632standardLeakyReLU 37MSElat 820352.18 / 566.46 1.95 / 2.66 1.67 / 2.28 11 65.3M 04 312832standardLeakyReLU 37L1lat 820321.09 / 555.92 1.83 / 2.62 1.55 / 2.06 14 17.3M 12 312832standardLeakyReLU 37MSElat 820339.38 / 583.84 1.86 / 2.70 1.59 / 2.14 14 17.3M 14 312832standardPReLU 37L1lat 820359.24 / 578.92 1.95 / 2.70 1.71 / 2.28 717.3M GSA-ResNets 15 312832LAS LeakyReLU 37MSElat 28 0 317.60 / 542.47 1.80 / 2.56 1.55 / 2.14 11 37.6M 16 312832standardLeakyReLU 73MSElat 820310.22 / 550.17 1.77 / 2.58 1.49 / 2.02 55 21.5M

09 -12864standardLeakyReLU --MSElat 028322.91 / 555.19 1.83 / 2.61 1.55 / 2.06 29 9.18M

ResNets 10 -25664standardLeakyReLU --L1lat 028305.40 / 542.32 1.79 / 2.61 1.50 / 2.03 19 34.9M

17 312832standardLeakyReLU 77L1lat 820315.38 / 558.53 1.79 / 2.62 1.50 / 2.05 33 16.7M Future 6. Experiments on WeatherBench 32

Table 6.3: Baselines in WeatherBench and our best models scored on the same data contained in WeatherBench using the RMSElat metric. A light grey row background highlights our best model.

Model z500 (3/5 d) t850 (3/5 d) t2m (3/5 d) Persistence 936 / 1033 4.23 / 4.56 3.00 / 3.27 Weekly Climatology 816 3.50 6.07 IFS T42 489 / 743 3.09 / 3.83 3.21 / 3.69 IFS T63 268 / 463 1.85 / 2.52 2.04 / 2.44 Operational IFS 154 / 334 1.36 / 2.03 1.35 / 1.77 Linformer Weather [1]505/7242.44/3.16- UNet Weyn et al. [33]373/6111.98/2.87- ResNet19 Continuous (ERA only) [2]331/545 1.87/2.571.60/2.06 ResNet19 Direct (ERA only) [2]314/5611.79/2.821.53/2.32 GSA-ResNet #16 (ERA only) 310.22 / 550.17 1.77 / 2.58 1.49 / 2.02 ResNet19 Continuous (pretrained) [2]284/499 1.72 / 2.41 1.48 / 1.92 ResNet19 Direct (pretrained) [2] 268 /523 1.65 /2.52 1.42 /2.03

To keep the number of plots small and manageable in the following sections, we will constrain ourselves to the candidate models shown in Table 6.4 to conduct further research on.

Table 6.4: Candidate models picked from Table 6.2 for further research and evaluations.

# Model L dim bs norm act wa loss kr z500 t850 t2m

10 ResNet - 128 64 standard LeakyReLU --MSElat 028305.40 / 542.32 1.79 / 2.61 1.50 / 2.03 16 GSA-ResNet 3 128 32 standard LeakyReLU 73MSElat 820310.22 / 550.17 1.77 / 2.58 1.49 / 2.02 06 GSA-Net 64 128 32 standard LeakyReLU 77L1lat 80 324.48 / 573.77 1.83 / 2.69 1.53 / 2.08

6.4.1 Relative Positional Embeddings

This section deals with the learned relative positional embeddings of the GSA models. The following subsections provide plots for the column matrix Rc and row matrix Rr for every layer. We can observe that the learned relative positional embeddings in Figures 6.1 and 6.2 are definitely not random. We observe in both figures that almost all the embedding layers are very sparse and give weight to specific relative locations, best visible in Figure 6.2.Also,ithas to be noted that the magnitude of the values in the embeddings in Figure 6.1 are almost zero and most likely do not contribute to the attention. 6. Experiments on WeatherBench 33

GSA-ResNet #16

Figure 6.1: Learned relative positional embedding of size L =3for the model GSA-ResNet #16. 6. Experiments on WeatherBench 34

GSA-Net #06

Figure 6.2: Learned relative positional embedding of size L = 64 for the model GSA-Net #06. 6. Experiments on WeatherBench 35

6.4.2 First Indicators on Future Ideas

The provided list of experiments in Table 6.2 also contains one small experiment on future work. The results of experiment #17 should act as a first indicator of how well novel ideas perform. We were specifically interested in keeping the time dimension in the data grid and not stack it along one channel dimension, like in [2]. Hence, the model architecture does slightly differ from the one presented in Chapter 5. Since these experiments are not the main focus of this thesis, we will merely provide the scores and do not include further research on them. We will elaborate more on future work in Section 8.2, but we will shortly go over the main differences in architecture.

By not stacking the channels and thus collapsing the time dimension into the channels, the input data is of size (batch, time, vars, h, w), where vars stands for the different parameters over different pressure levels, i.e. t850. First, instead of adding a 2D fixed positional encoding to the input, we add a 3D fixed sinusoidal positional encoding computed over the time dimension. For the model to further process the input, we reshape the date to (batch, vars, h time, w). By · doing so the 2D grid consists of consecutive blocks of rows per day, which can be interpreted as a sequence present per parameter channel. Figure 6.3 provides a visualization. Before applying the last layer that transforms the intermediate result to the desired number of output channels, we reshape the data back to (batch, time, vars, h, w) and get rid of the time dimension by taking the mean over it with torch.mean(tensor, dim=1).

Figure 6.3: Example visualization for data after reshaping it to (batch, vars, h time, w). ·

6.5 Extreme Weather Events

To compare the model’s scores we evaluate them using the same metric that averages the prediction performance over all forecasts. Thus, having more accurate forecasts for individual dates only mildly inﬂuences the average score. But exactly these individual dates might be of extreme importance, containing weather forecasts under extreme weather conditions. We want to elaborate on that thought and investigate if Transformers perform slightly better in the mean but may convincingly perform better under extreme weather conditions. 6. Experiments on WeatherBench 36

We evaluate the candidate models on the Storm of the Century (1993), Hurricane Katrina (2005) and on Cyclone Emma (2008). Each subsection provides the error for 1, 3 and 5 days lead time for the variables z500 and t2m. Additionally, we provide saliency plots, the top 20 gradients, attention heads and the Transformer’s contribution to the prediction for 3 days into the future. All the remaining plots can be found in the appendix Appendix A.2.

Before diving into the results for the different dates, we want to briefly explain how the plots mentioned above were generated. Most of the plots are based on the gradients computed on the inputs. More specifically, let the gradients for our stacked channels models be W grad B C H W 2 R ⇥ ⇥ ⇥ , where B is the batch size, C is the number of stacked channels and H W is the ⇥ size of the 2D values grid. Additionally, let the prediction be (X; ⇥) and input tensor be X. F Since we only evaluate the model on one specific date, our batch size is B =1. One can then obtain the gradients by setting requires_grad_() on the input tensor in PyTorch and running one backward pass after computing the loss function.

(X; ⇥) W grad = F (6.4) X

Top 20 Gradients

We note that channels (one-to-one correspondence to single variables) with the largest absolute input gradient sum contribute the most to the predictions. The gradients of the input tensor will be large in regions where small changes lead to the biggest variance in the prediction. Since we give the model three time slices to train on (-12h,-6h,0h), we aggregate the same variables, i.e. t2m, by summing the gradients of the corresponding channels. Let Ip denote the set of channel indices for a given parameter p that contain values for the same variable, but at diﬀerent time, i.e. I = 2, 39, 87 . By sorting the absolute summed gradients S per variable p, we can plot t2m { } p a ranked list of variables that should most inﬂuence a prediction. We will only plot the top 20 gradients per model per chosen event for simplicity.

Nlat Nlon grad Sp = Wb,c,i,j (6.5) c Ip i j X2 X X Saliency Maps

Like the absolute summed gradients per channel above, we need to project the 3D gradients onto a 2D plane. We project the absolute gradient sum onto the 2D plane, meaning that we aggregate B H W the absolute gradients Mb,i,j R ⇥ ⇥ over all the channels C. By doing so, we can identify 2 regions that undergo the most dynamic to form the prediction. We provide one plot per model per event.

C grad Mb,i,j = Wb,c,i,j (6.6) c X 6. Experiments on WeatherBench 37

Attention Heads & Aﬃne Layers

We project the attention heads and aﬃne layers with the same procedure as for the saliency plots, deﬁned in Equation (6.6).

Transformer Contributions

Our goal is to visualize where the Transformer enhances the output of the ResNet spatially. Let (X; ⇥) denote the intermediate result with model parameters ⇥ when only running the FResNet input through the ResNet backbone. Let (X; ⇥) be the result obtained by feeding FGSA-ResNet the input up to the last layers that pad the result and reduce the hidden dimension to the diff C H W number of output channels. We can then compute the diﬀerence Y R ⇥ ⇥ , listed in 2 Equation (6.7).

Y diff = (X; ⇥) (X; ⇥) (6.7) FGSA-ResNet FResNet

To visualize the difference, we reduce the information to two dimensions. The final 2D plot diff contains the values vi,j with largest absolute difference in Y for every pixel (i, j) (see Equation (6.8)). The plot does not tell us to what parameters the Transformer contributes but reveals where it adapts predictions spatially and by how much at most.

diff diff vi,j = Yc,i,j where c = argmax Yc,i,j (6.8) c 6.5.1 Storm of the Century (1993)

The Gulf of Mexico is a famous site for the formation of many storms. This one was no exception. The Storm of the Century, sometimes also referred to as Great Blizzard of 93, appeared ﬁrst on 12 March 1993. It was one of the most intense cyclones ever for the United States to be witnessed. This storm went down in history due to its enormous size, high winds, coastal ﬂooding and freezing air. The damage caused is estimated to $2 billion.

With record low temperatures down to -24°C this so-called Great Blizzard swept over parts of the south and east US. This eﬀect can be observed in the temperature plots in Figure A.1.Partially caused by these cold winds were the heavy snowfalls that led to major disturbances in public transportation. In Florida heavy rainfalls have been reported and are visible in Figure A.4. One additional striking property of that storm was the low barometric pressure observed in Figure A.2.

Table 6.5 provides the scores of the candidate models during the Storm of the Century for lead times one, three and ﬁve days. The subsequent sections compare our candidate models on the same day for a ﬁxed lead time of three days. 6. Experiments on WeatherBench 38

Figure 6.4: A visualization of the total snowfall in inches during the Storm of the Century for the days March 11 to 13 1993. The ﬁgure is provided by the National Centers for Environmental Information [46].

Table 6.5: RMSElat scores during Storm of the Century of models evaluated with lead times 24h,72h and 120h on predictions for 13 March 1993 10am.

# Model z500 (1/3/5 d) t850 (1/3/5 d) t2m (1/3/5 d) 11 CNN5 [42] 324.98 / 650.08 / 873.66 1.97 / 3.06 / 3.92 2.51 / 2.76 / 3.67 10 ResNet28 112.21 / 185.84 / 289.61 1.06 / 1.35 / 1.65 1.01 / 1.26 / 1.40 16 GSA-ResNet 132.22 / 244.56 / 356.75 1.14 / 1.59 / 1.91 1.09 / 1.44 / 1.76 06 GSA-Net 127.24 / 237.61 / 394.58 1.13 / 1.58 / 2.03 1.07 / 1.37 / 1.69

ResNet28 #10

The model predictions in Figure 6.5 look similar to the ground truth on a coarse level. Both plots show some diﬀerences in local regions. There is especially one location, most likely inside the storm, where the model predicted a too high pressure for z500. Also, in the temperature prediction we can notice very local deviations, except one large patch at the east coast of the US, where the model’s predictions were too cold.

The saliency map in Figure 6.6 reveals that the gradients seem to be large over oceans. One big path in the Southern Ocean is very clearly recognizable. We can also observe a slight increase over parts of the US, which we would expect for this event.

The plot of the top 20 summed absolute gradients per channel in Figure 6.6 suggests that the most influential parameters contain the lead time and various temperatures and geopotentials at different heights. We can find other variables like winds, orography and the land-sea mask in the lower ranks. 6. Experiments on WeatherBench 39

Figure 6.5: ResNet28 #10 predictions for z500 and t2m.

Figure 6.6: Saliency map (left) and top 20 inﬂuential parameters (right).

GSA-ResNet #16

To use the available space most eﬃciently, we will only provide plots for the attention heads that are non-zero. It is peculiar to observe in Figure 6.7 that only 9 out of 64 attention heads are non-zero, and those which have learned something seem to be close to zero, very sparse, constant or focus on particular locations on the globe. Weirdly, the model seems to focus very strongly on the Antarctic landmass in layers 25-27.

Recall that this model applies an aﬃne layer to the attention output, whose dimensions were reduced with a 1 1 convolution. We can observe in Figure 6.8 that there is much more variety ⇥ in the learned aﬃne layers, which we would have expected to encounter in the attention heads. 6. Experiments on WeatherBench 40

Figure 6.7: Non-zero attention heads of GSA-ResNet #16 during the Storm of the Century. The learned heads are very sparse and most of them (roughly 86%) are empty.

The affine output of layer 21, for example, is all over the place. The reader can observe elevated responses around the coasts of the US and Europe. Layer 22 most likely focuses on landmasses and specific locations inside the continents. The spatial region where the affine layers affect the prediction the most coincide with the Transformer’s contribution to the ResNet backbone, visible in Figure 6.9.

Figure 6.8: Aﬃne output per layer of network #16.

In the following Figure 6.10 we can see the predictions of the network for z500 and t2m.We observe that the coarse prediction structure resembles the ground truth in all variables but diﬀers 6. Experiments on WeatherBench 41

Figure 6.9: Contributions of the Transformer to the prediction. We notice very speciﬁc contributions. in value dependent on the spatial location.

Figure 6.10: Predictions for z500 and t2m of network #16. Predictions resemble ground truth but show deviations in the exact values.

When looking at the saliency map and the top 20 gradients in Figure 6.11, we observe some spots of large, absolute gradients near the coasts of the US and Europe. There are also larger gradients spread over mostly oceanic parts of the globe that contribute to the prediction. We have expected the saliency map to show large gradients around the Gulf of Mexico, which was not the case. According to the gradients of this model, the most inﬂuential parameters were lead time, temperature and geopotential over many pressure levels. 6. Experiments on WeatherBench 42

Figure 6.11: Saliency map (left) and ranked absolute summed gradients per parameter channel (right).

GSA-Net #06

For space reasons, we only include one attention head per layer in the plots in Figure 6.12. The remaining plots can be found in Appendix A.2.1. Upon further investigation, one can notice that the model focuses very strongly on the Antarctic in almost every attention head. Layers 2, 4 and 8 also show very local responses over land and sea.

Figure 6.12: Selection of non-zero attention heads for model GSA-Net #06 for 72h lead time. Visible strong responses over the Antarctic and locally pronounced responses over land and sea.

The model’s predictions for z500 and t2m look very similar to the ground truth. The diﬀerence plots Figure 6.13 show the biggest deviations in geopotential where the storm is located. The temperatures seem to diﬀer the most in the northwest regions of the US.

The saliency map Figure 6.14 is visibly most pronounced over seawater, speciﬁcally in the Southern Ocean and at the coasts of the US and Russia. The ranked list of variables also suggests that the most inﬂuential parameters involve all sorts of temperatures and geopotentials, but also orography, lead time and winds. 6. Experiments on WeatherBench 43

Figure 6.13: Model predictions for z500 and t2m.

Figure 6.14: Saliency map (left) and top 20 absolute summed gradients per parameter channel (right).

6.5.2 Hurricane Katrina (2005)

In August 2005, a large category 5 Atlantic hurricane wreaked havoc in large areas around the Gulf of Mexico. The damage caused amounted to billions of dollars and 1,800 deaths. The storm formed inside a tropical depression over the Atlantic and soon took a westward course in the direction of Florida. Despite losing strength over the landmasses of Florida, it then began to rapidly intensify upon reaching the Gulf of Mexico and eventually grew into a category 5 hurricane.

Appendix A.2.2 contains visualizations of the ERA5 ground truth for some variables for August 6. Experiments on WeatherBench 44

28 13 pm, when hurricane Katrina had its largest energy. Figure A.10 depicts the temperatures at pressure levels 250 hPa and 500 hPa. The storm’s increased temperature is clearly visible in the data, and are spread very widely up the atmosphere. These pressure levels cover a height of roughly 5.5 km to 11 km! This same trend can be observed in the geopotentials at pressure levels 500 hPa and 850 hPa, depicted in Figure A.11. As expected, the pressure inside the storm’s eye is signiﬁcantly lower than in the surrounding areas. The reported high winds can be observed in Figure A.12.

In the plots in Figure A.13 we can observe the speciﬁc humidity at 850 hPa and the total precipitation, with distinct peaks where the eye of the storm is located. It is known that high humidity is an essential ingredient for tropical storms to be formed. Since humidity and precipitation are positively correlated, the precipitation should also be higher, which agrees with our data.

Figure 6.15: A NASA satellite image showing hurricane Katrina. Image obtained from Britannica [47].

Table 6.6 provides the scores of the candidate models during the Hurricane Katrina for lead times one, three and ﬁve days. The subsequent sections compare our candidate models on the same day for a ﬁxed lead time of three days.

Table 6.6: RMSElat scores during Hurricane Katrina of models evaluated with lead times 24h, 72h and 120h on 28 August 2005 14pm.

# Model z500 (1/3/5 d) t850 (1/3/5 d) t2m (1/3/5 d) 11 CNN5 [42] 320.53 / 706.95 / 789.37 2.02 / 2.93 / 3.24 2.65 / 2.88 / 2.94 10 ResNet28 100.06 / 189.65 / 301.60 1.07 / 1.40 / 1.78 0.93 / 1.12 / 1.41 16 GSA-ResNet 113.84 / 229.29 / 378.34 1.17 / 1.55 / 2.09 0.98 / 1.26 / 1.64 06 GSA-Net 103.68 / 210.90 / 384.01 1.15 / 1.59 / 1.98 0.99 / 1.22 / 1.55

ResNet28 #10

Figure 6.16 suggests that the temperature differences are very evenly spaced out. Only the prediction for geopotential suffers from a surplus of the model’s prediction. It coincides exactly 6. Experiments on WeatherBench 45 where the hurricane hit land for the first time.

Figure 6.16: Model predictions for 72h lead time for variables z500 and t2m. The punctual diﬀerence in geopotential where the storm ﬁrst hit the land is visible.

For this event we would expect that large parts around the Gulf of Mexico are visible in the saliency map in Figure 6.17, since they contribute the most to the outcome observed. But we can merely spot one slightly elevated gradient located in the northeast of Florida.

Figure 6.17: Saliency map (left) and top 20 parameters (right) for 72h lead time. 6. Experiments on WeatherBench 46

GSA-ResNet #16

The following plots in Figure 6.18 show the state of all non-zero attention heads during this date. We note that again a large number of attention heads are empty, there are even whole layers with zero contribution in terms of attention heads. Layer 27 is an exception, where apparently most of the dynamic is happening. It is diﬃcult to say what the model concentrates on, the focus seems to be very spaced out. Weirdly, there is an extremely concentrated region in the Antarctica, visible in layers 24-27. Layer 28 seems to focus on the whole continent of Antarctica.

Figure 6.18: Subset of non-zero attention heads of GSA-ResNet #16 during Hurricane Katrina. Roughly 86% of the heads are empty. The focus is distributed all around the globe. Weird patches of high response in the Arctic for layer 24-27.

Judging the saliency map in Figure 6.19 the Southern Ocean and parts of South Africa influence the predictions the most. Interestingly the region around the Gulf of Mexico does not appear elevated in the plot. The model’s 20 most influential parameters include temperatures and geopotentials at different heights, lead time, winds and orography.

Figure 6.19: Saliency map (left) with one big concentrated region in the Southern Ocean and top 20 parameters by gradients on the right. 6. Experiments on WeatherBench 47

The outputs of the aﬃne layers in Figure 6.20 look similar to the ones in Figure 6.8 and also have similar properties. Layer 21 and 23 seem to have patches of increased response and higher responses over oceans, whereas layer 22 is clearly more concentrated on landmasses. Upon close inspection, Figure 6.21 shows clearly elevated responses in the Transformer’s contribution around the Gulf of Mexico. Also, the Transformer seems to strongly contribute over oceans, most prominently in the Southern Ocean.

Figure 6.20: Aﬃne layer output of model #16.

Figure 6.21: Plot shows the contribution of the Transformer to the prediction.

The model’s predictions are very close to the ground truth. The diﬀerence plots provided in Figure 6.22 reveal that the model overestimated the geopotential inside the storm. Temperature-wise the errors are very spread out and vary only between [ 1.5, +3.0] °C. 6. Experiments on WeatherBench 48

Figure 6.22: Model’s #16 predictions for variables z500, t850 and t2m. Prediction outlines look similar to the ground truth, but deviate in value for some regions.

GSA-Net #06

Due to space reasons, the following plots in Figure 6.23 will only show one attention head per layer. All the heads are provided in appendix in Figure A.17. The plotted heads all show strong responses over the Antarctica and Southern Ocean. Layer 4, for example, shows scattered focus over the northern and southern part of the globe.

Figure 6.23: Selection of non-zero attention heads of model #16. Heads have high attention on Antarctica. Scattered attention over majority of oceanic parts. 6. Experiments on WeatherBench 49

The diﬀerences in temperature of the model’s prediction and the ground truth are very spread out. Figure 6.24 shows no concentrated deviation. However, the model underestimated the storm’s depression, which can be seen by the locally pronounced diﬀerence in Figure 6.24.

Figure 6.24: Saliency map on the left, together with the top 20 absolute summed gradients on the right. Top inﬂuential parameters are mostly temperatures, geopotentials and the lead time.

The most important regions that influence the prediction are found over oceans and coasts. Here and there, we find increased responses over land, like in South Africa, Australia, Russia, Europe and the US. Looking at the absolute summed gradients tells us what parameters have the most influence. Among them are temperatures and geopotentials in the top ranks, followed by lead time and winds. 6. Experiments on WeatherBench 50

Figure 6.25: Saliency map (left) with increased responses in oceanic regions and top 20 parameters on the right.

6.5.3 Cyclone Emma (2008)

An extra-tropical cyclone named Emma led to major disruptions in public infrastructures and even to some deaths around Europe from 29 February to 07 March 2008. The storm originally formed in the vicinity of Newfoundland and on 28 February 2008 has made its way to the German coasts. In the aﬀected areas, heavy rainfall and violent winds were reported. The following ﬁgures depict data on 1st March 2008 4 am, a day after the storm has made landfall on Germany and the Netherlands.

The storm’s increased temperature can be best observed in Figure A.19, which show increased temperatures up to 250 hPa (11km) into the atmosphere. In Figure A.20 we can clearly recognize the depression of the storm. Similarly, the plots for the humidity and precipitation in Figure A.22 and the winds in Figure A.21 clearly match the reports.

Table 6.7 provides the scores of the candidate models during the Cyclone Emma for lead times one, three and ﬁve days. The subsequent sections compare our candidate models on the same day for a ﬁxed lead time of three days.

Table 6.7: RMSElat scores during Cyclone Emma of models evaluated with lead times 24h,72h and 120h on 1st March 2008 4am.

#Model z500 (1/3/5 d) t850 (1/3/5 d) t2m (1/3/5 d) 11 CNN5 [42] 299.27 / 634.25 / 708.26 2.05 / 2.88 / 3.28 2.52 / 2.69 / 2.96 10 ResNet28 97.37 / 181.13 / 263.99 1.04 / 1.35 / 1.60 1.06 / 1.25 / 1.46 16 GSA-ResNet 107.27 / 225.26 / 394.57 1.08 / 1.52 / 1.98 1.08 / 1.36 / 1.79 06 GSA-Net 106.02 / 223.30 / 380.32 1.09 / 1.59 / 2.06 1.07 / 1.46 / 1.74 6. Experiments on WeatherBench 51

ResNet28 #10

The following plots show the diﬀerence in the model’s prediction compared to the ground truth for the variables z500 and t2m. The errors for the geopotential prediction are very spread out, whereas the temperature diﬀerence clearly shows that the model underestimated the temperatures over Scandinavia.

Figure 6.26: Predictions of model #20 for variables z500 and t2m.

Figure 6.27 suggests that the most important variable for its predictions is lead time. The saliency plot also reveals high local responses located in oceans and around the coasts of Australia and South America. Other important variables are temperatures and geopotentials, winds, orography and total incoming solar radiation. 6. Experiments on WeatherBench 52

Figure 6.27: Saliency map (left) and top 20 most important parameters by gradients (right).

GSA-ResNet #16

The attention heads of model #16 show very peculiar patterns in layer 27. In the following Figure 6.28 only shows the non-zero attention heads. Interestingly, the heads in layer 24-28 are mostly empty. However, layers 24-27 show a bizarre spot located in the Antarctica. Layer 27 contains some two heads h3 and h4, which show some pattern. h3 seems to have a high response mostly over coasts and open waters and some speckles over land. Layer h4 has a similar pattern but is less pronounced and more regionally concentrated.

Figure 6.28: Non-zero attention heads of model #16 during Cyclone Emma.Mostofthe attention heads (roughly 86%) are empty. Some of the plots show peculiar increased, local spots in value.

The outputs of the aﬃne layers in Figure 6.29 look similar to the aﬃne results for previous dates and have similar structure. Especially in layers 21 and 23, we can observe elevated responses over Northern Europe. Generally, the coasts around the US, Africa and Europe, as well as regions in South America and a big patch in Asia are highly pronounced.

The contributions of the Transformer in Figure 6.30 reveal concrete patches of changes. Interestingly, the Transformer mostly corrects the ResNet output by subtracting contributions around regions 6. Experiments on WeatherBench 53

Figure 6.29: Aﬃne outputs for model #16. near the poles, except large parts in Africa. Especially in the Southern Ocean, the Transformer adds positive contributions to the ResNet outputs. Upon close inspection, we can recognize a slightly higher contribution over Europe in the plot.

Figure 6.30: Plot of the Transformer’s contribution.

The plots for the model’s predictions in Figure 6.31 reveal that they have roughly the same outline as the ground truth. Only in the prediction for t2m we can notice a concentrated diﬀerence in Scandinavia, where the model’s prediction was too cold. 6. Experiments on WeatherBench 54

Figure 6.31: Model #16 prediction for variables z500 and t2m.

Figure 6.32: Saliency map during Cyclone Emma on the left. Top 20 parameters by gradient on the right. Most locations of high importance in saliency plot are located in oceans. The 20 most important parameters include lead time and several temperatures and geopotentials. 6. Experiments on WeatherBench 55

GSA-Net #06

To save space, we will only provide one attention head per layer. All the remaining plots can be found in the appendix in Figure A.26. Like on the last dates the attention heads are very pronounced over the Antarctica. Only layer 1, 2 and 4 shows some additional scattered attention over the globe.

Figure 6.33: Selection of non-zero attention heads for GSA-only model #06. High focus on Antarctic landmasses.

The saliency plot in Figure 6.34 shows smeared out stripes of increased responses. These artefacts are due to the large impact of the lead time gradients. The fact that lead time seems to be very important for this day can also be observed in the variable list ranked by their gradients. Aside from lead time, temperatures and geopotentials, the orography, winds and the latitude grid also play an important role.

Figure 6.34: Saliency map on the left and top 20 inﬂuential parameters according to gradients on the right, both for 72h lead time. Increased smeared out responses are due to lead time.

Both plots for the diﬀerence between the model’s prediction and the ground truth visible in 6. Experiments on WeatherBench 56

Figure 6.35 have roughly the same structure. Only the values vary slightly, but not so much that we see concentrated areas of big error. The biggest error of the model in terms of temperature is made over Scandinavia.

Figure 6.35: Model predictions for z500 and t2m for 72h lead time. 6. Experiments on WeatherBench 57

6.6 Sensitivity Analysis

6.6.1 Variable Lead Time

The following plots in Figure 6.36 show the progression of the test error on variables z500, t850, t2m for 1-5 days lead time for our candidate models in Table 6.4.

Figure 6.36: Error progression on the test set for our candidate models for increasing lead time on variables z500, t850 and t2m. 6. Experiments on WeatherBench 58

6.6.2 Behaviour under Perturbations

Although the weather is a chaotic dynamic system, we would not expect huge differences if the values deviate slightly, maybe even so small differences that they’re in the bounds of measurement uncertainty. A volatile model would react very strongly to tiny changes in value and is not desirable. To investigate this effect, we slightly perturb the input data before it is normalized and fed into the model.

The perturbations are sampled from a distribution Hpert, shown in Equation (6.9). Let us denote the standard deviation for a speciﬁc parameter p as p, then we can deﬁne a matrix T Np ⌃ =(p1 ,...,pN ) R , where Np is the number of parameters of present in the input data. p 2

H (0, ⌃/10) (6.9) pert ⇠ N

We report the mean and standard deviation over 3 runs with diﬀerent seeds in Table 6.8 on the test set using the RMSElat metric for 3 and 5 days lead time. The results show that the errors decrease for longer lead times and are most noticeable for the z500 predictions. However, the deviations for z500 are small and the deviations for the predictions t850 and t2m are almost negligible, indicating that the models are robust. Interestingly, the averages of the perturbed predictions are worse than the model’s scores on the test set.

Table 6.8: Test score over 3 random runs with perturbed input data.

# Model z500 (3/5 d) t850 (3/5 d) t2m (3/5 d) 11 CNN5 [42] 639.54 0.03 / 759.82 0.08 2.89 4.49 10 5 / 3.32 2.39 10 4 2.81 2.58 10 4 / 3.05 2.32 10 4 ± ± ± ⇥ ± ⇥ ± ⇥ ± ⇥ 10 ResNet28 326.60 0.05 / 566.80 0.12 1.83 3.07 10 4 / 2.65 2.87 10 4 1.57 2.41 10 4 / 2.10 1.51 10 4 ± ± ± ⇥ ± ⇥ ± ⇥ ± ⇥ 16 GSA-ResNet 320.37 0.03 / 557.99 0.17 1.82 7.68 10 5 / 2.61 3.37 10 4 1.54 9.18 10 6 / 2.06 1.46 10 4 ± ± ± ⇥ ± ⇥ ± ⇥ ± ⇥ 06 GSA-Net 334.06 0.04 / 580.02 0.06 1.87 2.22 10 4 / 2.71 2.37 10 4 1.60 8.41 10 5 / 2.13 3.21 10 4 ± ± ± ⇥ ± ⇥ ± ⇥ ± ⇥

6.7 Discussion

We have taken a closer look at three diﬀerent candidate models and how they perform in the mean over the test set and under extreme weather conditions. The results and plots shall now be further discussed and give further insight into how they work and why they might work.

6.7.1 Performance

We tried to guide the model’s attention by enhancing our GSA blocks with two custom weighting layers, the Latitude Weighting and Aﬃne Weighting. Our evaluations in Table 6.2 suggest no improvement using Latitude Weighting over standard Transformers, although we have to be careful with this statement. For example, we can see this by comparing model #02 to models #04, #14, and #16, but we have to keep in mind that the models have not been trained for many epochs. Also, we observe a tiny advantage in performance using Aﬃne Weighting, noticeable in experiment #16.

Regarding the normalization method used, we note a small improvement in using LAS in 6. Experiments on WeatherBench 59 experiment #15 over using Z-Standardization in experiment #08.

Comparing the test score we note that the GSA-ResNet model #16 performs slightly better in the mean than just using a ResNet. Also, it marginally outperforms the direct ResNet19 (unpretrained, ERA5 -only) by Rasp and Thuerey [2] for all lead times and also performs slightly better than their continuous ResNet19 (unpretrained, ERA5 -only) for 3 days lead time. It is important to emphasize that the models only perform better in the mean. For predictions on dates during extreme weather events shown in Tables 6.5 to 6.7 the Transformer models are beaten by the ResNet28 by quite some margin. This phenomenon can also be observed by comparing the unpretrained, continuous ResNet19 with our experiment #16 for 5 days lead time. We assume that the ResNet models have a slight but noticeable drop in performance for ordinary days, but shine during extreme weather conditions.

The Figures 6.9, 6.21 and 6.30 suggest that adding a Transformer on top of a ResNet backbone only contributes to speciﬁc spatial regions, and does only slightly improve the overall performance. With the skip connection around the whole Transformer network, the updates would get very small if it is not beneﬁcial to the predictions. We noticed very small gradients for layers inside the Transformer during the model’s training. This would explain why for the GSA-Resnet model #16 the learned relative positional embeddings are tiny and why most of the attention heads (Figures 6.7, 6.18 and 6.28) are not used, i.e. zero.

Generally, we conclude from our experiments that extending a ResNet backbone by a Transformer does not yield many beneﬁts. Also, the Transformer-only models either perform worse or similarly compared to a normal ResNet.

6.7.2 Lacking Data?

Our evaluations reveal that similar variables (lead time, geopotentials, temperatures, winds, orography, latitude and land-seam mask) are top-ranked among the candidate models. By comparing the model’s differences to the ground truth it becomes apparent that the models make very similar predictions and also have similar errors around the same spots on the globe. This is very peculiar and leads us to think that the models seem to be stuck at making better predictions using the available atmospheric parameters. They can be stuck due to different reasons. It is possible that the atmospheric variables are simply not enough and the models require different data to perform better. Alternatively, the data resolution might be too coarse to extract more useful information for inference or the model architecture is simply sub-optimal for this task. One additional issue might be the missing sequence, which is addressed in Subsection 6.7.5. Here, we want to focus on the hypothesis of lacking data and highlight three different indicators that might justify it.

Saliency Maps

Revisiting the saliency maps in Figures 6.6, 6.11, 6.14, 6.17, 6.19, 6.25, 6.27, 6.32 and 6.34,our expectation was that gradients in the input should also be high over landmasses for some of the chosen dates, like for hurricane Katrina or the Storm of the Century. Instead, we see most large gradients over open water and near coasts. We believe that high responses are mostly found in areas of big global ocean gyres, shown in Figure 6.37. Water currents not only produce wind due to their ﬂow direction, they are also the main distributors of hot and cold water around the 6. Experiments on WeatherBench 60 globe and hence strongly inﬂuence climate and weather2. To enable our models to learn this, we think that more oceanic data needs to be added to the list of parameters. We will discuss one such dataset in our future suggestions in Section 8.2).

Figure 6.37: Map showing the biggest global ocean currents (gyres) with annotations. Figure adopted from the National Oceanic and Atmospheric Administration [48].

Diﬀerences to Ground Truth

Upon comparing the prediction differences of the models and the ground truth in Figures 6.5, 6.10, 6.13, 6.16, 6.22, 6.24, 6.26, 6.31 and 6.35, we can observe that the deviations of every model are similar and happen roughly around the same regions. This could mean two things. Either the model’s behaviour is similar and results in close predictions, or it is a hint that the models have a deficit in additional information required to overcome these differences. In the following sub-section we give one more reason why the models might be limited by the data available.

Top Gradients

In all our candidate models, the most prominent variables that determined the predictions were lead time, temperatures and geopotentials at different pressure levels. This behaviour is expected, since our models try to predict temperatures and geopotential. Near the end of each variable ranking Figures 6.6, 6.11, 6.14, 6.17, 6.19, 6.25, 6.27, 6.32 and 6.34 different variables like tisr, wind and potential vorticities come into play. Note, that all plots include roughly the same variables at comparable rankings. Our expectations were that this ranking will be much more dissimilar among models with different architecture. Since this is not the case, this might point our that the models are already doing their best with the available variables. We should not be surprised to see the lead time among the top ranked variables, since changing the lead time will for sure have a big impact on the final prediction. However, we were surprised that for the GSA-ResNet model #16 the lead time is of mild importance for every date judging the Figures 6.11, 6.19 and 6.32. In contrast for the models ResNet28 #10 and GSA-Net #06 it clearly is of big importance, visible in Figures 6.6, 6.14, 6.17, 6.25, 6.27 and 6.34.

2https://oceanexplorer.noaa.gov/facts/climate.html 6. Experiments on WeatherBench 61

6.7.3 Attention Heads & Aﬃne Layers

The attention heads of models #16 and #06 could not be any diﬀerent. We have seen in the plots Figures 6.7, 6.18 and 6.28 of model #16 that almost all heads are empty and if they are non-zero they have a very sparse pattern. On the opposite, the attention heads of model #06 in Figures 6.12, 6.23 and 6.33 contain much more dynamic and clearly indicate varied focus on spatial locations.

Surprising to us, the affine output in model #16 Figures 6.8, 6.20 and 6.29 contains the kind of pattern and structure we except in attention heads. It may be, since the Transformer’s contributions are small anyway, that it is not beneficial for the model to use the attention heads. We think that the model might push the role of the attention heads to the affine layers. The affine layers have the capability to learn a bias and a scaling per spatial pixel, which makes them very flexible. Especially layers 21/23 and 22 caught our attention. Whereas the responses in layers 21/23 are spread out over open water and concentrate around coasts of the US and Europe, layer 22 clearly focuses on landmasses only and responds very strongly to specific regions inside land.

Antarctic - A Hot Topic

We have observed that the attention heads of models #06 and #16 wildly vary, but they share one common thing. Both models seem to put large emphasis on the Antarctica and surrounding landmasses. It is unclear why the models do this, but we want to share some thoughts on this. We can think of three reasons why responses are high over the Antarctic.

The first one is that due to the distortions, the values simply vary more than in other regions and are more interesting. Secondly, the model might just over-emphasize the Antarctica due to having a disproportionate amount of gird squares compared to its actual area. And lastly, the Antarctic’s atmospheric state might be important for global weather. If it were due to distortion or grid square proportions, we would also expect a higher response at the North Pole, which is not as pronounced. The Antarctica plays a key role in our global climate. It is responsible for most of the heat exchange of waters by cooling down surface water that produces downward flux [49]. The motion of these waters, also heavily influenced by the occurring winds, produce currents around the globe that again enable heat exchange for different parts of the earth. The amount of ice is also crucial, as it reflects most of the radiation and heat back into the atmosphere, keeping temperatures low [50, 51].

6.7.4 Sensitivity

This thesis also investigated the model’s sensitivity to changing lead time and perturbations to the input data. The following sub-sections discuss our ﬁndings for variable lead time and input perturbations.

Variable Lead Time

The error progression of our candidate models visible in Figure 6.36 reveals that the error scales almost linearly and the models all seem to respond similarly to changing lead time. This is interesting as among the models we have different architectures ranging from ResNets (#10) 6. Experiments on WeatherBench 62 to GSA-ResNets (#16 & #17) and Transformer-only models (#06), that we thought to behave differently. We have also included experiment #17 that contains a proper sequence in the data, showing no different behaviour than the rest of the models. Towards longer lead times of 5 days into the future the errors begin to spread mildly. It is difficult to recognize what model has the least error progression since their error lines are very close. What we can see though, is that the model whose error grows fastest is the simple CNN #11 followed by the GSA-Net #06.

Perturbations

Despite perturbing the input fed to the models, they all seem robust. Table 6.8 does not reveal any clear outliers and shows largest deviations for the variable z500. The larger errors for z500 are explained by the scale of the, that for geopotentials is much bigger than for temperatures (two orders of magnitude smaller at least). The standard deviations for the two temperatures t850 and t2m are so small that it is negligible for any practical purpose.

We notice that the model’s mean errors over three random runs are shifted, which can also be observed in Table 6.9 that shows the relative increase in error compared to the unperturbed test score in Table 6.2. The most robust model in terms of perturbations is the simple CNN model, which might perform better due to few layers and small error progression inside the network. The shift seems to be larger for the ResNet models and decreases for longer lead times, suggesting that perturbations for predictions “far“ into the future do not have big inﬂuence. We can also interpret the lower relative errors of the Transformer models as inhibited sensitivity on values, which might be important for more extreme weather conditions. This would support the observation that they perform worse than their ResNet counterparts on such dates, although performing better in the mean.

Table 6.9: Relative increase of error in % for perturbed runs.

# Model z500 (3/5 d) t850 (3/5 d) t2m (3/5 d) 11 CNN5 [42] 0.70% / 0.20% 0.60% / 0.30% 0.70% / 0.60% 10 ResNet28 6.00% / 4.50% 2.20% / 1.50% 4.60% / 3.40% 16 GSA-ResNet 3.00% / 1.20% 2.80% / 1.20% 3.40% / 2.00% 06 GSA-Net 3.00% / 1.20% 2.10% / 0.70% 4.50% / 2.40%

6.7.5 Where’s the sequence?

Vaswani et al. [15] state that sequences are important for the Transformer to learn long-range temporal interactions. But we have not yet discussed what our sequences are exactly. The focus of this thesis was strongly inﬂuenced by the recent advancements of [2] and prior work by Peter Tatkowski [1]. Both works treat time by stacking it along one dimension to get a single channel dimension, as explained in Section 4.5. Thus, strictly speaking, our Transformer models don’t work on a proper sequence. They instead rely on the convolutions to “mix” the channels in a way that beneﬁts the training goal.

With the Transformer experiment #17 we have a model that has similar performance to the GSA-ResNet #16 and the ResNet #10 while using less parameters. The model incorporates 6. Experiments on WeatherBench 63 the time dimension in the data grid, forming a proper sequence required by the Transformer. We could not observe any diﬀerent behaviour for variable lead time and perturbed input data compared to the other Transformer models that stack the time slices along one single channel dimension. Thus this architecture serves as a good starting point for future work and further improvements. Chapter 7 Related Work

Convolutional networks are the de facto standard models in computer vision. They have excelled on many tasks, such as ImageNet [52], CIFAR-10 [53] and CIFAR-100 [53]. What makes them powerful is focusing on a neighbourhood’s values, constrained by the deﬁned kernel size. Due to their inductive bias they require very few parameters and are great at learning very local features per layer. However, for a CNN to capture global information, many layers need to be stacked to increase the overall receptive ﬁeld. This has the limitation that the model needs to be deep enough to relate very distant pixels to each other.

The introduction of the Transformer Architecture Vaswani et al. [15], and hence multi-head attention, has had a major impact on Natural Language Processing (NLP). With the ability to capture global dependencies in every layer, they are naturally built for sequence tasks. In the last years, many papers have shown that Transformers and attention are not only limited to language tasks but can be generalized further to operate on even higher-dimensional tasks [21–26]. One challenge of the standard Transformer is the quadratic memory and computational complexity. n d For a sequence of length n the computation of the product between the query Q R ⇥ k and the n d 2 2 keys K R ⇥ k already requires (n ) in resources. This limits building deep transformer-only 2 O networks on higher-dimensional data. Figure 7.1 shows a good overview of the design space of Transformers. In the following sections, we want to highlight some of the advancements achieved in related ﬁelds.

Figure 7.1: An overview over the design space of attention. Figure from Khan et al. [54].

64 7. Related Work 65

7.1 Attention Augmented Methods

First, we want to highlight works and methods that have the idea to keep the fundamental properties of CNNs but augment them with Transformer layers to improve performance.

Attention Augmented Convolution

Bello et al. [21] combine both ideas of convolutional layers and self-attention by augmenting convolutional operators with attention maps. Such a layer is just the concatenation of the standard convolution and the attention output map. Concretely they augmented an existing ResNet by replacing one convolution with its augmented version per residual block. The authors have achieved an increase of 1.3% in top1 accuracy on ImageNet [52] compared to a ResNet50 [55], while using fewer parameters.

DEtection TRansformer

DETR is a model for object detection by Carion et al. [26]. The architecture involves a conventional ResNet backbone responsible for the intermediate features augmented by a Transformer encoder-decoder architecture. Since the encoder expects a sequence as input, the spatial dimensions are collapsed into one dimension. The model is trained end-to-end using a bipartite matching loss function between the predictions and the ground truth. Scoring the model on the COCO [56] dataset, the authors achieved comparable performance to an optimized Faster R-CNN [57] baseline.

7.2 Global Self-Attention

Despite attention being more powerful than just a simple convolutional layer, the vanilla global attention comes with a higher cost, which is quadratic in storage and computational complexity. Simply applying one global self-attention layer on a coloured image of size 128 128 using float32 ⇥ already needs roughly 9 GB of memory to compute the resulting attention map. Training longer sequences is hence infeasible on many modern GPUs. Let alone replacing all convolutional layers with attention. Also, it is not obvious how attention should be applied to higher dimensional tensors. Many suggestions have been published on how to do attention more eﬃciently [25, 58–60] but they lack in keeping the spatial coherence of the data. The following papers address these mentioned challenges.

Axial Attention

Axial attention, or often also referred to as criss-cross attention, has been introduced by Ho et al. [23], which addressed the performance issue and handling of higher-dimensional data. They introduced the Axial Transformer that later has also been used in [22]. It is a model that stacks self-attention per axes of the input tensor. Since the length of a single axis is often smaller than the total number of elements, we can save resources. To be precise, for a d-dimensional tensor with N total elements we can save (N (d 1)/d) in computation and memory without O 7. Related Work 66 custom kernels. The model has shown to perform up to par to SOTA methods on ImageNet32 [52]andImageNet64 [52].

Global Self-Attention Modules

Global self-attention (GSA) modules introduced by Shen et al. [27] addressed the problem of the quadratic computational cost and provided a new attention module for image data. The module consists of two parallel attention layers, one that computes attention over the content and one that computes attention over positional features. The computational cost for the overall attention is (NpN) for N number of pixels and enables us building larger networks consisting O of global-self attention only. Meaning that in theory, all convolutions can be replaced by GSA modules. Their models outperformed convolution-based networks and other attention-based networks on the CIFAR-100 [53]andImageNet[52] datasets while using fewer parameters and computational complexity.

Vision Transformer

The introduction of the Vision Transformer (ViT ) in Dosovitskiy et al. [24] has shown excellent performance on ImageNet [52], CIFAR-100 [53] and a hand of other datasets. Interestingly they divide the input image into patches and feed a sequence of patches to the transformer, to which they also add a custom embedding and positional encoding.

7.3 Eﬃcient Attention

It is of big interest to reduce the Transformer’s quadratic memory and computational complexity, allowing training deeper models and longer sequences. There are basically two approaches to reduce the consumed resources, the ﬁrst one is to reduce the sequence length, and the second one is to reduce the cost of the attention operation. In the following subsections we will focus on recent papers that reduce the computational complexity of attention.

Linformer

Results obtained by Wang et al. [59] have shown that self-attention is of low-rank and can hence be approximated by a lower-rank matrix. Their suggested method called the Linformer involves n k adding two new linear learned projections Ei,Fi R ⇥ during the computation of the keys and 2 K V values for sequence length n and hidden dimension d. The original keys KWi and values VWi of a head h are then projected into (k d) dimensional latent spaces. They then compute a i ⇥ context mapping P¯ of dimension (n k) using the scaled dot-product function. The ﬁnal context ⇥ embeddings for a head i then look like the following:

Q K T QWi (EiKWi ) V Linformer(Q, K, V )i = softmax FiVWi (7.1) pdk ! · k d ⇥ P¯:n k ⇥ | {z } | {z } 7. Related Work 67

The total memory and computational cost of the Linformer are (nk) and can thus be drastically O reduced by choosing a very small k<

Performers

A similar approach by Choromanski et al. [61] aims to approximate full-rank attention kernels, without adding any additional priors. To do this, the authors introduce a new method called Fast Attention Via positive Orthogonal Random (FAVOR+), that approximates the attention matrix via random feature maps. Their new Transformer architecture called the Performer based on FAVOR+ has only a linear cost in memory and computational complexity and gives strong theoretical guarantees for nearly unbiased estimation of the attention kernel, uniform convergence and low estimation time. Using this novel architecture the authors could achieve competitive results compared to other Transformers.

Big Bird

Big Bird, introduced by Zaheer et al. [62], is a novel attention mechanism that enforces the sparsity of the heads by incorporating ﬁxed rules where pixels can attend to. The rules include that there are g tokens that attend to the whole sequence, all tokens can attend to w local neighbours and all tokens can attend to r random tokens. The number of operations to compute the attention is linear in the number of tokens and thus allows for larger contexts to be processed. Their experiments show increased performance on various NLP tasks compared to other Transformers.

7.4 Weather Forecasts

Post-Processing

Quantifying uncertainty in weather forecasts is crucial, not only to have more accurate predictions but also to predict extreme weather conditions. It is common to have an ensemble system that produces several predictions on slightly perturbed starting conditions. However, this incurs a high computational cost. The works of Grönquist et al. [34] have focused on using only a subset of the ensemble and post-processing it via deep learning. They have found that their model improves the forecast skill measured with CRPS [63]by14%. They also suggest a new normalization method called Local Area Standardization (LAS), which is of interest to this thesis.

Nowcasting

Nowcasting involves weather predictions in the next couple of hours. The deﬁnition of the maximum time horizon varies between sources but is usually 0 to 6 hours in the future1. We consider MetNet to be one such model suggested in Sønderby et al. [4]. The model does precipitation forecasts and relies on axial attention by combining radar and satellite images to

1Numbers are taken from Federal Oﬃce of Meteorology and Climatology MeteoSwiss [64]. The exact number of hours varies per source but all the sources agree on a notion of very short-term forecasts. 7. Related Work 68 produce a probability map. They found that their model outperforms numerical weather systems up to 7 to 8 hours on the scale of the US.

Recent advancements by Trebing et al. [65]provideaU-Net adaptation called SmaAt-UNet that combines attention blocks and depthwise-separable convolutions to predict precipitation. The authors evaluate their approach on real-world data over Netherlands and binary images of cloud coverage over France and found that their model performs similarly to other examined models while using a quarter of the number of parameters.

There is also more recent work by Zhang et al. [66] on precipitation nowcasting for 0-2 hours into the future that does not rely on attention. The authors train a RNN on radar and automatic weather station data. It consists of two diﬀerent parallel RNN encoders, the Rainfall Encoder and the Radar Echo Encoder, which are then combined with a fusion module and then fed to the network part that makes the ﬁnal predictions. Their RN-Net architecture is shown in Figure 7.2. By evaluating the model on the Southeastern China dataset the authors report big improvements in rainfall prediction performance over standard methods.

Figure 7.2: RN-Net architecture that combines two RNN encoders for rainfall and radar echo by a fusion module. The intermediate result is then fed to the Rainfall Predictor. Figure adopted from Zhang et al. [66].

Sparse Attention and Weather

Prior to this thesis, Peter Tatkowski [1] has worked on making Transformers work on weather data and address the quadratic computational complexity hidden in the attention mechanism. They provide benchmarks comparing the Linformer [59] and Sinkhorn Transformer [58] to other weather prediction architectures contained in WeatherBench [32]. 7. Related Work 69

U-Net with Cube Sphere Mapping

The authors Weyn et al. [33] operate on the WeatherBench benchmark and tackle the distortions of the data that arose when projecting them onto a 2D grid. Using a U-Net and mapping the input grid values to a cubed-sphere, they achieved better performance. Their model only yields predictions for a short lead time of 6 hours. To get the full predictions, for say 1 day into the future, they have to apply the model iteratively. They train the model over many steps using a custom multi-step loss function to prevent the model from diverging with increasing steps.

Stacked Network on Weather Probabilities

Clare et al. [67]alsoworkontheWeatherBench benchmark and provide insights into the most important variables and pressure levels by extensive data exploration. They introduce a novel approach of a stacked neural network that predicts a probability density functions over categorical values. To get categorical values they have to transform the input in some way. They do this by binning the values into 100 bins of equal width for each variable. The stacked network combines the output probabilities of individual ResNets that have been trained to predict a subset of the data.

Equivariance-Preserving Transformer (under review)

Atotallydifferent approach by Chattopadhyay et al. [68] suggests using a deep spatial Transformer on top of a U-Net that serves for good latent representations of weather. The authors rely on the U-Net to preserve the equivariance property. Using a smart data assimilation technique that combines the predictions of data-driven models with different time steps, they achieve more accurate predictions for short lead times. Also, they train the models over many steps by ingesting noisy data for better initial conditions for the successive steps. The authors evaluate their model on geopotential z500 from ERA5 and find that their model outperforms a standard U-Net by 45%. Chapter 8 Conclusion & Future Work

8.1 Conclusion

Our findings suggest that the Transformer added on top of a ResNet mainly relies on the ResNet ’s performance. We have shown that the Transformer only weakly contributes to the predictions and is very specific. Although performing better in the mean, Transformers suffer from a big performance drop during extreme weather conditions compared to ResNets.Bystudyingthe attention maps, attention heads and gradients of the models, we identified a particular patch of attention over the Antarctic and generally high responses over open waters, aligning with locations of big global ocean currents. Lastly, we observed that all the models have similar errors in the predictions and show similar rankings of most influential variables, raising the question if the models are missing additional data, like variables describing the ocean’s state.

8.2 Future Work

The following sections should motivate future improvements and address problems of this thesis, as well as suggestions and ﬁndings of more recent papers.

Placement of normalization

The encoder layers of the models presented in the results sections usually don’t involve many layers, the smallest ones use merely 8 layers and the biggest one 28. We have not further experimented which normalization order PreNorm or PostNorm works best and stuck to the order presented in Vaswani et al. [15]. It has been reported by Wang et al. [29], Domhan [69] that for deeper models PreNorm is more eﬃcient due to better back-propagation of the gradients.

Normalization Layers

Choosing the right normalization layers has a huge impact on the model’s convergence properties and can even add beneﬁcial inductive biases. As in many computer vision tasks, this thesis makes use of Group Normalization, Layer Normalization and Batch Normalization. Since those mentioned methods compute the normalization over the entire spatial dimension, important signals might get lost. We think that the weather’s behaviour is especially determined by local features. Local Context Norm [70] tries to solve this problem, by normalizing the features around a local windowed context (similar to LAS [34]). The authors Ortiz et al. [70] found that their 70 8. Conclusion & Future Work 71 method outperforms Batch Normalization, Instance Normalization, Group Normalization and Layer Norm on several object detection and image segmentation datasets.

Residual Weighting Layers

For future work it might be beneﬁcial to wrap the custom weighting layers inside residual connections, such that the model can work around it if it is not helping. Also, we think that always applying the weighting layers without adding them back to the original signal (i.e. Aﬃne Weighting ) may result in smaller gradient responses and harm training.

Pretraining

Although we did not apply pretraining to our models, in natural language processing papers [71–73] and also in recent works on weather prediction [1, 2], the authors have achieved an increase of roughly 6% in performance. It is advised to use a diﬀerent dataset to pretrain the models. We suggest reading into the ENS10 [74] dataset, which contain ensemble weather data that might be exploited in a smart fashion.

Data Subset Selection

The data subset to train our models in this thesis was more or less aligned with the variables used in [2]. One simple, alternate approach is to compute the correlation of the variables with the output variables, similar to what has been done in [67].

Mixing Networks & Deeper Models

Our presented network architecture in Chapter 5 only allows the Transformer to follow a ResNet backbone. It might be interesting to adapt the architecture such that the Transformer can be in the middle of two partial ResNets. Also, it might be beneﬁcial to train deeper Transformer models that consist of more than 8 encoder layers, as have been used in this thesis.

Antarctica’s Contributions

We have shown that the attention heads and aﬃne layers (if present) have a peculiar focus on the Antarctica. It may be interesting to determine why this region is so important to the models and if it is important quantifying how much it contributes to the predictions. We have the idea of evaluating the models by masking away parts of the Antarctica in every layer during testing so that we see how much worse the predictions without these regions are.

Oceanic Data

We talked about adding oceanic parameters to the dataset, which might help make better predictions. Similar to ERA5, there is another dataset at ECMWF called OCEAN5 [75]. Figure 8.1 visualizes how some of the measurements were done. Included variables are heat 8. Conclusion & Future Work 72 contents of the ocean around 300m-700m, the isotherm depth of 20°C and several values related to ice, like sea ice concentration and sea ice thickness. Additionally, sea surface height and net upward and downward heat ﬂux might also be interesting.

Figure 8.1: Visualization of how measurements of the data in OCEAN5 were obtained. Figure adopted from the European Centre for Medium-Range Weather Forecasts [76]

Reduce Training Time

Deep models might take a long time to converge. In our case experiment #16 took more than 56h to ﬁnally converge. As of now, the testbed AULT only provides runs of maximum four consecutive hours, which does not facilitate checkpoint management and run logs for models that require several submissions. If working correctly, mixed-precision might cut the training time almost down to half the current required time, which is already a big improvement. Another key factor is data loading, research conducted by Ivanov et al. [77] have found to be able to achieve a speedup of 1.30 by optimizing data movement. This would also allow more experiments to be ⇥ done in a shorter period of time.

Hierarchical Attention

Varma T and Prabhu [78] have submitted a report on the reproducibility of recent Transformer architectures like ViT [24]orSAN [79] that is still in review. They also investigate the connection of attention layers to convolutions and provide visualizations of attention heads. Previous work by Cordonnier et al. [80] has provided a theorem that specifies when MHA can mimic a convolution operation, which gives further justification for attention layers to replace convolutional layers fully. Their findings suggest that a MHA layer is at least as expressive as a convolution. The authors state that any MHA with Nh heads using Dh many dimensions per head, output 8. Conclusion & Future Work 73 dimension D and relative positional encoding of size D 3 can learn to produce the same out p result as a convolution of kernel size (pNh, pNh) and min(Dh,Dout) output channels.

What might be of interest is the newly introduce Hierarchical Attention in Varma T and Prabhu [78], that shared the weights for the key projection K among all attention layers. By using this attention method, the authors claim to have achieved a 5% gain in accuracy and reduced convergence time on existing models, such as ViT.

Investigating Robustness

Recent work by Kim et al. [81] has focused on quantifying how robust attention is. One such metric is the Lipschitz Constant that tells us how fast a function can change at most. The authors report that the scaled dot-product function used in attention is not Lipschitz and thereby suggest using alternate attention based on L2 self-attention that is Lipschitz. It could be interesting to investigate the Lipschitz Constant of future attention methods to get a theoretical bound by how much the model predictions might vary. Appendix A Appendix

A.1 Best Model Conﬁgurations

ResNet28 #10

Table A.1: Parameter conﬁguration of best ResNet model listed in Table 6.2.

Parameter Key Value Resnet Layers resnet_layers 28 Lead Time Sampler lead_time_sampler (0, 120) Hidden Dim hidden_dim 128 Activation hidden_act LeakyRelu(0.3) Criterion criterion MSElat Dropout input_dropout 0.0 Positional Encoding use_pos_enc 3 Batch Size batch_size 64 Normalization norm normal Stacking Channels stack_channels 3 Direct Predictions direct 7 Relative Positional Embedding Size rel_pos_len 3 4 Learning Rate optimizer_learning_rate 1 10 ⇥ Optimizer optimizer Adam [82] Optimizer Betas optimizer_betas (0.9, 0.98) 5 Optimizer Weight Decay optimizer_weight_decay 1 10 ⇥ Optimize Every optimize_every 1

A-1 Appendix A-2

GSA-ResNet #16

Table A.2: Parameter conﬁguration of best GSA-ResNet model listed in Table 6.2.

Parameter Key Value Model Repetitions model_prepetitions 8 Resnet Layers resnet_layers 20 Lead Time Sampler lead_time_sampler (0, 120) Hidden Dim hidden_dim 128 Activation hidden_act LeakyRelu(0.3) Criterion criterion MSElat Dropout input_dropout 0.0 Positional Encoding use_pos_enc 3 Batch Size batch_size 32 Normalization norm normal Stacking Channels stack_channels 3 Direct Predictions direct 7 Relative Positional Embedding Size rel_pos_len 3 Key Dimensions dk 32 Num Heads num_heads 8 Latitude Weighting lat_weight_attention 7 Aﬃne Weighting affine 3 4 Learning Rate optimizer_learning_rate 1 10 ⇥ Optimizer optimizer Adam [82] Optimizer Betas optimizer_betas (0.9, 0.98) 5 Optimizer Weight Decay optimizer_weight_decay 1 10 ⇥ Optimize Every optimize_every 1 Appendix A-3

GSA-Net #06

Table A.3: Parameter conﬁguration of best GSA-Net model listed in Table 6.2.

Parameter Key Value Model Repetitions model_prepetitions 8 Resnet Layers resnet_layers 0 Lead Time Sampler lead_time_sampler (0, 120) Hidden Dim hidden_dim 128 Activation hidden_act LeakyRelu(0.3) Criterion criterion MSElat Dropout input_dropout 0.0 Positional Encoding use_pos_enc 3 Batch Size batch_size 32 Normalization norm normal Stacking Channels stack_channels 3 Direct Predictions direct 7 Relative Positional Embedding Size rel_pos_len 64 Key Dimensions dk 32 Num Heads num_heads 8 Latitude Weighting lat_weight_attention 7 Aﬃne Weighting affine 7 4 Learning Rate optimizer_learning_rate 1 10 ⇥ Optimizer optimizer Adam [82] Optimizer Betas optimizer_betas (0.9, 0.98) 5 Optimizer Weight Decay optimizer_weight_decay 1 10 ⇥ Optimize Every optimize_every 1 Appendix A-4

A.2 Extreme Events

A.2.1 Storm of the Century (1993)

Ground Truth

Figure A.1: Freezing cold temperatures caused by The Storm of the Century at heights 250 hPa and 500 hPa on March 13 10:00, 1993

Figure A.2: Notable depression in the center of the storm, visualized for pressure levels 500 hPa and 850 hPa on March 13 10:00, 1993. Appendix A-5

Figure A.3: Absolute wind velocities at heights 250 hPa,850hPa,850hPa.

Figure A.4: Relative humidity at 850 hPa and precipitation around the southeast part of the US on 13 March 1993 10 am. The peaks in the plots depict the heavy snowfalls that were reported. Appendix A-6

ResNet28 #10

Figure A.5: Predictions of model #10 for 72h lead time for variables z500, t850 and t2m. Appendix A-7

GSA-ResNet #16

Figure A.6: Attention heads of model #16 evaluated on a day during the Storm of the Century. Appendix A-8

GSA-Net #06

Figure A.8: Attention heads of model #06 evaluated on a day during the Storm of the Century. Appendix A-10

A.2.2 Hurricane Katrina (2005)

Figure A.10: Plot from ERA5 showing the temperatures around the Gulf of Mexico in heights of 250 hPa and 500 hPa. Clear temperature diﬀerences are visible.

Figure A.11: Plot from ERA5 showing the geopotentials around the Gulf of Mexico in heights of 250 hPa and 850 hPa. The lower pressure of the storm is visible in both heights. Appendix A-12

Figure A.12: Absolute wind velocities at heights 250 hPa,850hPa,850hPa.

Figure A.13: Figure showing relative humidity at 850 hPa and precipitation around the Gulf of Mexico. One can clearly recognize distinct peaks of both near the storm’s eye.

ResNet28 #10 Appendix A-13

GSA-ResNet #16

Figure A.15: Attention heads of model #16 evaluated on a day during hurricane Katrina. Appendix A-15

GSA-Net #06

Figure A.17: Attention heads of model #06 evaluated on a day during hurricane Katrina. Appendix A-17

A.2.3 Cyclone Emma (2008)

Figure A.19: Temperatures around Europe during storm Emma at heights of 250 hPa and 850 hPa. The warmer temperatures are noticeable up high into the atmosphere. Data exported from WeatherBench dataset. Appendix A-19

Figure A.20: Geopotentials around Europe at heights of 250 hPa and 850 hPa. The depression is clearly visible. Data exported from WeatherBench dataset.

Figure A.21: Absolute wind velocities at heights 250 hPa,850hPa,850hPa. Data exported from WeatherBench dataset. Appendix A-20

Figure A.22: Figure showing relative humidity at 850 hPa and precipitation around the Gulf of Mexico. One can clearly recognize distinct peaks of both near the eye of the storm. Appendix A-21

ResNet28 #10

GSA-ResNet #16

Figure A.24: Attention heads of model #16 evaluated on a day during cyclone Emma. Appendix A-23

GSA-Net #06

Figure A.26: Attention heads of model #06 evaluated on a day during cyclone Emma. Appendix A-25

Figure A.27: Predictions of model #06 for 72h lead time for variables z500, t850 and t2m. Bibliography

[1] Peter Tatkowski. Using (sparse) transformers for weather prediction, 2020.

[2] Stephan Rasp and Nils Thuerey. Data-driven medium-range weather prediction with a resnet pretrained on climate simulations: A new model for weatherbench, 2020.

[3] xkcd - A webcomic of romance, sarcasm, math, and language. URL https://imgs.xkcd.com/ comics/machine_learning.png. Online: accessed April 25 2021.

[4] Casper Kaae Sønderby, Lasse Espeholt, Jonathan Heek, Mostafa Dehghani, Avital Oliver, Tim Salimans, Shreya Agrawal, Jason Hickey, and Nal Kalchbrenner. Metnet: A neural weather model for precipitation forecasting, 2020.

[5] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.

[6] Seyyed Hossein Hasanpour, Mohammad Rouhani, Mohsen Fayyaz, and Mohammad Sabokrou. Lets keep it simple, using simple architectures to outperform deeper and more complex architectures, 2018.

[7] Farhana Sultana, Abu Suﬁan, and Paramartha Dutta. Advancements in image classiﬁcation using convolutional neural network. 2018 Fourth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN),Nov2018.doi: 10.1109/icrcicn.2018.8718718. URL http://dx.doi.org/10.1109/ICRCICN.2018.8718718.

[8] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-excitation networks, 2019.

[9] Yong Wang, Xinbin Luo, Lu Ding, Shan Fu, and Xian Wei. Detection based visual tracking with convolutional neural network. Knowledge-Based Systems,175:62–71,2019. ISSN 0950-7051. doi: https://doi.org/10.1016/j.knosys.2019.03.012. URL https://www. sciencedirect.com/science/article/pii/S0950705119301339.

[10] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun. Megdet: A large mini-batch object detector. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6181–6189, 2018. doi: 10.1109/CVPR.2018.00647.

[11] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation, 2015.

[12] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation, 2015.

[13] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions, 2016.

[14] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroﬀ, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation, 2018.

[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017. A-26 BIBLIOGRAPHY A-27

[16] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection, 2020.

[17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.

[18] A. Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018.

[19] Alec Radford, Jeﬀ Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.

[20] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeﬀrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.

[21] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V. Le. Attention augmented convolutional networks, 2020.

[22] Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation, 2020.

[23] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers, 2019.

[24] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2020.

[25] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers, 2019.

[26] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers, 2020.

[27] Zhuoran Shen, Irwan Bello, Raviteja Vemulapalli, Xuhui Jia, and Ching-Hui Chen. Global self-attention networks for image recognition, 2020.

[28] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.

[29] Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. Learning deep transformer models for machine translation. ArXiv,abs/1906.01787, 2019.

[30] Chaitanya Joshi. Transformers are graph neural networks. The Gradient,2020.

[31] European Centre for Medium-Range Weather Forecasts, . URL https://www.ecmwf.int/en/ forecasts/datasets/reanalysis-datasets/era5. Online: accessed April 20 2021.

[32] WeatherBench. https://github.com/pangeo-data/WeatherBench, 2019. Accessed: 2020-11-03. BIBLIOGRAPHY A-28

[33] Jonathan A. Weyn, Dale R. Durran, and Rich Caruana. Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere. Journal of Advances in Modeling Earth Systems, 12(9), Sep 2020. ISSN 1942-2466. doi: 10.1029/ 2020ms002109. URL http://dx.doi.org/10.1029/2020MS002109.

[34] Peter Grönquist, Chengyuan Yao, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Shigang Li, and Torsten Hoeﬂer. Deep learning for post-processing ensemble weather forecasts, 2020.

[35] European Centre for Medium-Range Weather Forecasts, . Online: accessed April 20 2021, url = https://conﬂuence.ecmwf.int/pages/viewpage.action?pageId=181130838.

[36] European Centre for Medium-Range Weather Forecasts, 2021. URL https://conﬂuence. ecmwf.int/display/CKB/ERA5+CDS%3A+Data+corruption. Online; accessed April 17 2021.

[37] Andrew L. Maas. Rectiﬁer nonlinearities improve neural network acoustic models. 2013.

[38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation, 2015.

[39] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips. cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.

[40] S. Hoyer and J. Hamman. xarray: N-D labeled arrays and datasets in Python. Journal of Open Research Software, 5(1), 2017. doi: 10.5334/jors.148. URL http://doi.org/10.5334/ jors.148.

[41] Lukas Biewald. Experiment tracking with weights and biases, 2020. URL https://www. wandb.com/. Software available from wandb.com.

[42] Stephan Rasp, Peter D. Dueben, Sebastian Scher, Jonathan A. Weyn, Soukayna Mouatadid, and Nils Thuerey. Weatherbench: A benchmark dataset for data-driven weather forecasting, 2020.

[43] Xavier Glorot and Yoshua Bengio. Understanding the diﬃculty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics,volume9of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL http://proceedings.mlr.press/v9/glorot10a.html.

[44] Anonymous. Lambdanetworks: Modeling long-range interactions without attention. In Submitted to International Conference on Learning Representations,2021.URLhttps:// openreview.net/forum?id=xTJEN-ggl1b. under review.

[45] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks, 2018.

[46] National Centers for Environmental Information, 2017. URL https://www.ncei.noaa.gov/ news/1993-snow-storm-of-the-century. [Online; accessed April 08 2021]. BIBLIOGRAPHY A-29

[47] Britannica. Satellite image of hurricane katrina, 2020. URL https://cdn.britannica.com/74/121674-050-C458B2B5/ satellite-image-National-Oceanic-and-Atmospheric-Administration-August-28-2005.jpg. [Online; accessed April 07, 2021].

[48] National Oceanic and Atmospheric Administration. URL https://oceanservice.noaa.gov/ facts/gyre.html. Online: accessed April 25 2021.

[49] World Ocean Review. URL https://worldoceanreview.com/en/wor-1/climate-system/ great-ocean-currents/. Online: accessed April 27 2021.

[50] Norwegian Polar Institute. URL https://www.npolar.no/en/themes/ global-climate-change/. Online: accessed April 25 2021.

[51] Australian Government - Bureau of Meteorology. URL http://www.bom.gov.au/ant/ handbook/handbook_16june04.pdf. Online: accessed April 25 2021.

[52] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.

[53] Alex Krizhevsky, Vinod Nair, and Geoﬀrey Hinton. Cifar-10 and cifar-100 datasets. URl: https://www. cs. toronto. edu/kriz/cifar. html,6(1):1,2009.

[54] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey, 2021.

[55] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.

[56] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.

[57] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks, 2016.

[58] Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention, 2020.

[59] Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity, 2020.

[60] Aurko Roy, Mohammad Saﬀar, Ashish Vaswani, and David Grangier. Eﬃcient content-based sparse attention with routing transformers, 2020.

[61] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers, 2021.

[62] Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences, 2021.

[63] James E. Matheson and Robert L. Winkler. Scoring rules for continuous probability distributions. Management Science,22(10):1087–1096,1976.URLhttps://EconPapers. repec.org/RePEc:inm:ormnsc:v:22:y:1976:i:10:p:1087-1096. BIBLIOGRAPHY A-30

[64] Federal Oﬃce of Meteorology and Climatology MeteoSwiss, 2020. URL https://www.meteoswiss.admin.ch/home/measurement-and-forecasting-systems/ warning-and-forecasting-systems/nowcasting.html. Online; accessed April 12 2021.

[65] Kevin Trebing, Tomasz Staczyk, and Siamak Mehrkanoon. Smaat-unet: Precipitation nowcasting using a small attention-unet architecture. Pattern Recognition Letters,145: 178–186, 2021. ISSN 0167-8655. doi: https://doi.org/10.1016/j.patrec.2021.01.036. URL https://www.sciencedirect.com/science/article/pii/S0167865521000556.

[66] Fuhan Zhang, Xiaodong Wang, Jiping Guan, Meihan Wu, and Lina Guo. Rn-net: A deep learning approach to 0–2 hour rainfall nowcasting based on radar and automatic weather station data. Sensors,21(6),2021.ISSN1424-8220.doi:10.3390/s21061981.URLhttps: //www.mdpi.com/1424-8220/21/6/1981.

[67] Mariana Clare, Omar Jamil, and Cyril Morcrette. A computationally eﬃcient neural network for predicting weather forecast probabilities, 2021.

[68] Ashesh Chattopadhyay, Mustafa Mustafa, Pedram Hassanzadeh, Eviatar Bach, and Karthik Kashinath. Towards physically consistent data-driven weather forecasting: Integrating data assimilation with equivariance-preserving deep spatial transformers, 2021.

[69] Tobias Domhan. How much attention do you need? a granular analysis of neural machine translation architectures. In ACL,2018.

[70] Anthony Ortiz, Caleb Robinson, Dan Morris, Olac Fuentes, Christopher Kiekintveld, Md Mahmudulla Hassan, and Nebojsa Jojic. Local context normalization: Revisiting local normalization, 2020.

[71] A. Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018.

[72] Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1031. URL https://www.aclweb.org/anthology/P18-1031.

[73] Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. Pre-trained models for natural language processing: A survey, 2020.

[74] Ens10. https://www.ecmwf.int/en/forecasts/documentation-and-support/extended-range/ re-forecast-medium-and-extended-forecast-range, 2019. Accessed: 2020-11-03.

[75] European Centre for Medium-Range Weather Forecasts, . URL https://www.ecmwf.int/en/ elibrary/18519-ocean5-ecmwf-ocean-reanalysis-system-and-its-real-time-analysis-component. Online: accessed April 25 2021.

[76] European Centre for Medium-Range Weather Forecasts, . URL https://www.ecmwf.int/en/about/media-centre/news/2021/ world-meteorological-day-focuses-role-ocean-weather-and-climate. Online: accessed April 25 2021.

[77] Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoeﬂer. Data movement is all you need: A case study on optimizing transformers, 2020. BIBLIOGRAPHY A-31

[78] Mukund Varma T and Nishant Satish Prabhu. [re]: On the relationship between self-attention and convolutional layers, 2021. URL https://openreview.net/forum?id= 4hm5ufX69jo.

[79] Haoneng Luo, Shiliang Zhang, Ming Lei, and Lei Xie. Simpliﬁed self-attention for transformer-based end-to-end speech recognition, 2020.

[80] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship between self-attention and convolutional layers, 2020.

[81] Hyunjik Kim, George Papamakarios, and Andriy Mnih. The lipschitz constant of self-attention, 2020.

[82] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014. URL http://arxiv.org/abs/1412.6980. cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015.

Declaration of originality

The signed declaration of originality is a component of every semester paper, Bachelor’s thesis, Master’s thesis and any other degree paper undertaken during the course of studies, including the respective electronic versions.

Lecturers may also require a declaration of originality for other written papers compiled for their courses. ______

I hereby confirm that I am the sole author of the written work here enclosed and that I have compiled it in my own words. Parts excepted are corrections of form and content by the supervisor.

Title of work (in block letters):

Authored by (in block letters): For papers written by groups the names of all authors are required.

Name(s): First name(s):

With my signature I confirm that − I have committed none of the forms of plagiarism described in the ‘Citation etiquette’ information sheet. − I have documented all methods, data and processes truthfully. − I have not manipulated any data. − I have mentioned all persons who were significant facilitators of the work.

I am aware that the work may be screened electronically for plagiarism.

Place, date Signature(s)

For papers written by groups the names of all authors are required. Their signatures collectively guarantee the entire content of the written paper.