Loss for Mode Connecting and Fast Ensembling

Gregory W. Benton 1 Wesley J. Maddox 1 Sanae Lotfi 1 Andrew Gordon Wilson 1

Abstract

AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilTg9HHGnf65crbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fja/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4bWfCZWkyBVbLApTSTAms+fJQGjOUE4soUwLeythI6opQxtRyYbgLb+8Slq1qndRrd1fVuo3eRxFOIFTOAcPrqAOd9CAJjCQ8Ayv8OY8Oi/Ou/OxaC04+cwx/IHz+QPOtY/Q 1 ✓AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilTg9HHGm/1i9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+75ScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZs+TgdCcoZxYQpkW9lbCRlRThjaikg3BW355lbRqVe+iWru/rNRv8jiKcAKncA4eXEEd7qABTWAg4Rle4c15dF6cd+dj0Vpw8plj+APn8wfQOY/R 2 ✓AAAB73icbVBNS8NAEN34WetX1aOXxSJ4Kkkr6LHoxWMF+wFtKJvtpF262cTdiVBC/4QXD4p49e9489+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqDk0ey1h3AmZACgVNFCihk2hgUSChHYxvZ377CbQRsXrASQJ+xIZKhIIztFKnhyNA1q/1S2W34s5BV4mXkzLJ0eiXvnqDmKcRKOSSGdP13AT9jGkUXMK02EsNJIyP2RC6lioWgfGz+b1Tem6VAQ1jbUshnau/JzIWGTOJAtsZMRyZZW8m/ud1Uwyv/UyoJEVQfLEoTCXFmM6epwOhgaOcWMK4FvZWykdMM442oqINwVt+eZW0qhWvVqneX5brN3kcBXJKzsgF8cgVqZM70iBNwokkz+SVvDmPzovz7nwsWtecfOaE/IHz+QPRvY/S 3 ✓AAAB73icbVBNS8NAEN34WetX1aOXxSJ4Kkkt6LHoxWMF+wFtKJvtpF262cTdiVBC/4QXD4p49e9489+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqDk0ey1h3AmZACgVNFCihk2hgUSChHYxvZ377CbQRsXrASQJ+xIZKhIIztFKnhyNA1q/1S2W34s5BV4mXkzLJ0eiXvnqDmKcRKOSSGdP13AT9jGkUXMK02EsNJIyP2RC6lioWgfGz+b1Tem6VAQ1jbUshnau/JzIWGTOJAtsZMRyZZW8m/ud1Uwyv/UyoJEVQfLEoTCXFmM6epwOhgaOcWMK4FvZWykdMM442oqINwVt+eZW0qhXvslK9r5XrN3kcBXJKzsgF8cgVqZM70iBNwokkz+SVvDmPzovz7nwsWtecfOaE/IHz+QPTQY/T 4 ✓AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilTg9HHGnf65crbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fja/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4bWfCZWkyBVbLApTSTAms+fJQGjOUE4soUwLeythI6opQxtRyYbgLb+8Slq1qndRrd1fVuo3eRxFOIFTOAcPrqAOd9CAJjCQ8Ayv8OY8Oi/Ou/OxaC04+cwx/IHz+QPOtY/Q 1 ✓AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilTg9HHGm/1i9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+75ScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZs+TgdCcoZxYQpkW9lbCRlRThjaikg3BW355lbRqVe+iWru/rNRv8jiKcAKncA4eXEEd7qABTWAg4Rle4c15dF6cd+dj0Vpw8plj+APn8wfQOY/R 2 ✓AAAB73icbVBNS8NAEN34WetX1aOXxSJ4Kkkr6LHoxWMF+wFtKJvtpF262cTdiVBC/4QXD4p49e9489+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqDk0ey1h3AmZACgVNFCihk2hgUSChHYxvZ377CbQRsXrASQJ+xIZKhIIztFKnhyNA1q/1S2W34s5BV4mXkzLJ0eiXvnqDmKcRKOSSGdP13AT9jGkUXMK02EsNJIyP2RC6lioWgfGz+b1Tem6VAQ1jbUshnau/JzIWGTOJAtsZMRyZZW8m/ud1Uwyv/UyoJEVQfLEoTCXFmM6epwOhgaOcWMK4FvZWykdMM442oqINwVt+eZW0qhWvVqneX5brN3kcBXJKzsgF8cgVqZM70iBNwokkz+SVvDmPzovz7nwsWtecfOaE/IHz+QPRvY/S 3 ✓AAAB73icbVBNS8NAEN34WetX1aOXxSJ4Kkkt6LHoxWMF+wFtKJvtpF262cTdiVBC/4QXD4p49e9489+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqDk0ey1h3AmZACgVNFCihk2hgUSChHYxvZ377CbQRsXrASQJ+xIZKhIIztFKnhyNA1q/1S2W34s5BV4mXkzLJ0eiXvnqDmKcRKOSSGdP13AT9jGkUXMK02EsNJIyP2RC6lioWgfGz+b1Tem6VAQ1jbUshnau/JzIWGTOJAtsZMRyZZW8m/ud1Uwyv/UyoJEVQfLEoTCXFmM6epwOhgaOcWMK4FvZWykdMM442oqINwVt+eZW0qhXvslK9r5XrN3kcBXJKzsgF8cgVqZM70iBNwokkz+SVvDmPzovz7nwsWtecfOaE/IHz+QPTQY/T 4 ✓AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilTg9HHGnf65crbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fja/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4bWfCZWkyBVbLApTSTAms+fJQGjOUE4soUwLeythI6opQxtRyYbgLb+8Slq1qndRrd1fVuo3eRxFOIFTOAcPrqAOd9CAJjCQ8Ayv8OY8Oi/Ou/OxaC04+cwx/IHz+QPOtY/Q 1 ✓AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J With a better understanding of the loss surfaces for multilayer networks, we can build more robust and accurate training procedures. Recently it was discovered that independently trained SGD so- lutions can be connected along one-dimensional paths of near-constant training loss. In this pa- per, we show that there are mode-connecting sim- Figure 1. A progressive understanding of the loss surfaces of neu- plicial complexes that form multi-dimensional ral networks. Left: The traditional view of loss in parameter space, of low loss, connecting many indepen- in which regions of low loss are disconnected (Goodfellow et al., 2014; Choromanska et al., 2015). Center: The revised view of dently trained models. Inspired by this discov- loss surfaces provided by work on mode connectivity; multiple ery, we show how to efficiently build simplicial SGD training solutions are connected by narrow tunnels of low complexes for fast ensembling, outperforming loss (Garipov et al., 2018; Draxler et al., 2018; Fort & Jastrzeb- independently trained deep ensembles in accu- ski, 2019). Right: The viewpoint introduced in this work; SGD racy, calibration, and robustness to dataset shift. training converges to different points on a connected of Notably, our approach only requires a few train- low loss. We show that paths between different training solutions ing epochs to discover a low-loss , start- exist within a large multi-dimensional of low loss. We ing from a pre-trained solution. Code is avail- provide a two dimensional representation of these loss surfaces in able at https://github.com/g-benton/ Figure A.1. loss-surface-simplexes.

isolated low loss modes that can be found through training 1. Introduction randomly initialized networks. In the center we have a Despite significant progress in the last few years, little is more contemporary view, showing that there are paths that known about neural network loss landscapes. Recent works connect these modes. On the right we present a new view — have shown that the modes found through SGD training that all modes found through standard training converge to multi-dimensional of randomly initialized models are connected along narrow points within a single connected volume pathways connecting pairs of modes, or through tunnels of low loss. connecting multiple modes at once (Garipov et al., 2018; We introduce Simplicial Pointwise Random Optimization Draxler et al., 2018; Fort & Jastrzebski, 2019). In this (SPRO) as a method of finding simplexes and simplicial paper we show that there are in fact large multi-dimensional complexes that bound volumes of low loss in parame- arXiv:2102.13042v1 [cs.LG] 25 Feb 2021 simplicial complexes of low loss in the parameter space of ter space. With SPRO we are able to find mode con- neural networks that contain arbitrarily many independently necting spaces that simultaneously connect many indepen- trained modes. dently trained models through a a single well defined multi- The ability to find these large volumes of low loss that dimensional manifold. Furthermore, SPRO is able to explic- can connect any number of independent training solutions itly define a space of low loss solutions through determining represents a natural progression in how we understand the the bounding vertices of the , meaning loss landscapes of neural networks, as shown in Figure1. that computing the dimensionality and volume of the space In the left of Figure1, we see the classical view of loss become straightforward, as does sampling models within surface structure in neural networks, where there are many the complex. This enhanced understanding of loss surface structure en- 1New York University. Correspondence to: Gregory W. Benton . ables practical methodological advances. Through the abil- ity to rapidly sample models from within the simplex we can form Ensembled SPRO (ESPRO) models. ESPRO works Loss Surface Simplexes by generating a simplicial complex around independently Most closely related to our work, Fort & Jastrzebski(2019) trained models and ensembling from within the simplexes, propose viewing the loss landscape as a series of potentially outperforming the gold standard deep ensemble combina- connected low-dimensional wedges in the much higher di- tion of independently trained models (Lakshminarayanan mensional parameter space. They then demonstrate that et al., 2017). We can view this ensemble as an approxima- sets of optima can be connected via low-loss connectors tion to a Bayesian model average, where the posterior is that are generalizations of Garipov et al.(2018)’s proce- uniformly distributed over a simplicial complex. dure. Our work generalizes these findings by discovering higher dimensional mode connecting volumes, which we Our paper is structured as follows: In Section3, we intro- then leverage for a highly efficient and practical ensembling duce a method to discover multi-dimensional mode con- procedure. necting simplexes in the neural network loss surface. In Section4, we show the existence of mode connecting volumes and provide a lower bound on the dimensional- 3. Mode Connecting Volumes ity of these volumes. Building on these insights, in Sec- We now show how to generalize the procedure of Garipov tion5 we introduce ESPRO, a state-of-the-art approach et al.(2018) to discover simplices of mode connecting vol- to ensembling with neural networks, which efficiently av- umes, containing infinitely many mode connecting curves. erages over simplexes. In Section6, we show that ES- In Section3, we then show how to use our procedure to PRO also provides well-calibrated representations of uncer- demonstrate the existence of these volumes in modern neu- tainty. We emphasize that ESPRO can be used as a sim- ral networks, revising our understanding about the structure ple drop-in replacement for deep ensembles, with improve- of their loss landscapes. In Sections5 and6 we show how to ments in accuracy and uncertainty representations. Code we can use these discoveries to build practical new methods is available at https://github.com/g-benton/ which provide state of the art performance for both accuracy loss-surface-simplexes. and uncertainty representation. We refer to our approach as SPRO (Simplicial Pointwise Random Optimization). 2. Related Work The study of neural network loss surfaces has long been 3.1. Simplicial Complexes of Low Loss intertwined with an understanding of neural network gen- To find mode connecting volumes we seek simplexes and eralization. Hochreiter & Schmidhuber(1997) argued that simplicial complexes of low loss. Two primary reasons we flat minima provide better generalization, and proposed an seek simplexes of low loss are that (i) simplexes are defined optimization algorithm to find such solutions. Keskar et al. by only a few points, and (ii) simplexes are easily sam- (2017) and Li et al.(2018) reinvigorated this argument by pled. The first means that to define a mode connecting visualizing loss surfaces and studying the geometric proper- simplicial complex of low loss we need only find a small ties of deep neural networks at their minima. Izmailov et al. number of vertices to fully determine the simplexes in the (2018) found that averaging SGD iterates with a modified complex. The second point means that we have easy access learning rate finds flatter solutions that generalize better. to the models contained within the simplex, leading to the Maddox et al.(2019) leveraged these insights in the context practical simplex-based ensembling methods presented later of Bayesian deep learning to form posteriors in flat regions in the paper. of the loss landscape. Moreover, Maddox et al.(2020) found many directions in parameter space that can be perturbed We consider data D, and training objective L. We refer to S as the k-simplex formed by ver- without changing the training or test loss. (a0,a1,...,ak) tices a0, a1, . . . , ak, and V(S(a0,...,ak)) as the volume Freeman & Bruna(2017) demonstrated that single layer of the simplex.1 Simplicial complexes are denoted ReLU neural networks can be connected along a low loss K(S ,S ,...,S ), (a0,a1,...,aNa ) (b0,b1,...,bNb ) (m0,m1,...,mNm ) curves. Draxler et al.(2018) and Garipov et al.(2018) simul- and their volume is computed as the sum of the volume of taneously demonstrated that it is possible to find low loss their components. We use wj to denote modes, or SGD curves for ResNets and other deep networks. Skorokhodov training solutions, and θj to denote mode connecting points. & Burtsev(2019) used multi-point optimization to parame- For example, we could train two independent models to find terize wider varieties of in loss surfaces, including parameter settings w0 and w1, and then find mode connect- exotic shapes such as cows. Czarnecki et al.(2019) then ing point θ0 such that the path w0 → θ0 → w1 traversed proved theoretically that low dimensional spaces of nearly low loss parameter settings as in Fort & Jastrzebski(2019) constant loss do exist in the loss surfaces of deep ReLU and Garipov et al.(2018). networks, but did not provide an algorithm to find these loss 1 surfaces. We use Cayley-Menger to compute the volume of simplexes, for more information see Appendix A.1. Loss Surface Simplexes

3.2. Simplicial Complexes With SPRO In general only a small amount of regularization is needed, and results are not sensitive to the choice of λ . In Section To find a simplicial complex of low loss solutions, we first j 5 we explain how to adapt Eq.1 to train simplexes of low find a collection of modes through standard w0, . . . , wk loss using single independetly trained models.... We provide training. This procedure gives the trivial simplicial com- details about how we choose λj in Appendix A.2. plex K(S(w0),...,S(wk)) (or K), a complex containing k disjoint 0-simplexes. With these modes we can then itera- tively add connecting points, θj, to join any number of the 4. Volume Finding Experiments 0-simplexes in the complex, and train the parameters in θ j In this section, we find volumes of low loss in a variety of such that the loss within the simplicial complex, K, remains settings. First, we show that the mode finding procedure of low. The procedure to train these connecting θ forms the j Garipov et al.(2018) can be extended to find distributions core of the SPRO algorithm, given here. of modes. Then, we explore mode connecting simplicial To gain intuition, we first consider some examples be- complexes of low loss in a variety of settings, and finally fore presenting the full SPRO training procedure. As we provide an empirical upper bound on the dimensionality of have discussed, we can take modes w0 and w1 and train the mode connecting spaces.

θ0 to find a complex K(S(w0,θ0),S(w1,θ0)), which recov- ers a mode connecting path as in Garipov et al.(2018). Loss Surface Plots. Throughout this section and the re- Alternatively, we could connect θ0 with more than two mainder of the paper we display two-dimensional visual- modes and build the complex K(S(w0,θ0),...,S(w4,θ0)), izations of loss surfaces of neural networks. These plots connecting 5 modes through a single point, similar to represent the loss within the plane defined by the three the m-tunnels presented in Fort & Jastrzebski(2019). points (representing parameter vectors) in each plot. More SPRO can be taken further, however, and we could train specifically, if the three points in question are, e.g., w0, w1, (one at a time) a sequence of θ ’s to find the complex 1 P2 j and w2 then we define c = 3 i=0 wi as the center of the K(S(w0,θ0,θ1,θ2),S(w1,θ0,θ1,θ2),S(w2,θ0,θ1,θ2)), describing a points and use Gram-Schmidt to construct u and v, an or- multi-dimensional volume that simultaneously connects 3 thonormal for the plane defined by the points. With modes through 3 shared points. the center and the basis chosen, we can sample the loss at parameter vectors of the form = + + where We aim to train the θ ’s in K such that the expected loss for w c ruu rvv ru j − models in the simplicial complex is low and the volume of and rv range from R to R, a range parameter chosen such the simplicial complex is as large as possible. That is, as that all the points are within the surface with an reasonable th . we train the j connecting point, θj, we wish to minimize Eφ∼KL(D, φ) while maximizing V(K), using φ ∼ K to indicate φ follows a uniform distribution over the simplicial 4.1. Volumes of Connecting Modes complex K. In Bayesian deep learning, we wish to form a predictive Following Garipov et al.(2018), we use M parameter distribution through a posterior weighted Bayesian model H average: vectors randomly sampled from the simplex, φh=1 ∼ 1 PH K, to compute H h=1 L(D, φh) as an estimate of Z 2 Eφ∼KL(D, φ). In practice we only need a small num- p(y|x, D) = p(y|w, x)p(w|D)dw , (2) ber of samples, H, and for all experiments use H = 5 to balance between avoiding significant slowdowns in the loss where y is an output (e.g., a class label), x is an input (e.g., and ensuring we have reasonable estimates of the an image), D is the data, and w are the neural network loss over the simplex. Using this estimate we train θj by weights. This integral is challenging to compute due to the minimizing the regularized loss, complex structure of the posterior p(w|D). 1 X To help address this challenge, we can instead approximate Lreg(K) = L(D, φh) − λj log(V(K)). H (1) the Bayesian model average in a subspace that contains φ ∼K h many good solutions, as in Izmailov et al.(2019). Here, we The regularization penalty λj balances the objective be- generalize the mode connecting procedure of Garipov et al. tween seeking a smaller simplicial complex that contains (2018) to perform inference over subspaces that contain strictly low loss parameter settings (small λj), and a larger volumes of mode connecting curves. complex that that may contain less accurate solutions but In Garipov et al.(2018), a mode connecting curve is de- encompasses more volume in parameter space (large λ ). j fined by its parameters θ. Treating the objective used to 2We discuss the exact method for sampling, and the implica- find θ in Garipov et al.(2018), l(θ), as a likelihood, we tions on bias in the loss estimate in Appendix A.1. infer an approximate Gaussian posterior q(θ|D) using the Loss Surface Simplexes

1.05 SWAG procedure of Maddox et al.(2019), which induces 0 a distribution over mode connecting curves. Each sample 2 0.90 from q(θ|D) provides a mode connecting curve, which itself

contains a space of complementary solutions. 0.75 Cross Entropy Loss w w w0 1 In Figure2, we see that it is possible to move between w0 1 0.60 different values of θ without leaving a region of low loss. We show samples from the SWAG posterior, projected into 0.45 2 the plane formed by the endpoints of the curves, w0 and 2 0.30 w1, and a mode connecting point θ0. We show the induced connecting paths from SWAG samples with orange lines. w1 0 0 1 0.15 All samples from the SWAG posterior lie in the region of low loss, as do the sampled connecting paths, indicating that 0.00 there is indeed an entire volume of connected low loss solu- tions induced by the SWAG posterior over θ. We provide Figure 3. Loss surfaces for planes intersecting a mode connecting training details in the Appendix A.4. simplicial complex K(S(w0,θ0,θ1,θ2),S(w1,θ0,θ1,θ2)) trained on CIFAR-10 using a VGG16 network. Top: along any w0 → θj → 9.6 w1 path we recover a standard mode connecting path. Bottom 0 8.4 Left: a of one of the simplexes that contains one of the 7.2 independently trained modes. We see that as we travel away from 6.0 w1 along any path within the simplex we retain low train loss. 4.8 Bottom Right: the simplex defined by the three mode connecting 3.6 points. Any point sampled from within this simplex defines a w0 w1 Cross Entropy Loss 2.4 low-loss mode connecting path between w0 and w1. 1.2 SWAG Samples 0.0

Figure 2. A loss surface in the basis spanned by the defining points of a connecting curve, w0, w1, θ0. Using SWAG, we form a pos- terior distribution over mode connecting curves, representing a volume of low loss explanations for the data. all training solutions connected by paths of low loss, they are points on the same multi-dimensional manifold of low 4.2. Simplicial Complex Mode Connectivity loss. In the bottom right panel of Figure3, every point in the simplex corresponds to a different mode connecting curve. The results of Section 4.1 suggest that modes might be con- nected by multi-dimensional paths. SPRO represents a natu- In Figure4, we show there exist manifolds of low loss ral generalization of the idea of learning a distribution over that are vastly more intricate and high dimensional than a connecting paths. By construction, if we use SPRO to find simple composition of 3-simplexes connecting two modes. the simplicial complex K(S(w0,θ0,...,θk),...,S(wm,θ0,...,θk)) In Figure 4a, we connect 4 modes using 3 connecting points we have found a whole space of suitable vertices to connect so that we have four different simplexes formed between the modes w0, ··· , wm. Any θ sampled from the k-simplex the modes of low loss for VGG16 networks (Simonyan &

S(θ0,...,θk) will induce a low-loss connecting path between Zisserman, 2014) on CIFAR-100. The structure becomes any two vertices in the complex. considerably more intricate as we expand the amount of modes used; Figure 4b uses 7 modes with 9 connecting To demonstrate that SPRO finds volumes of low points, forming 12 inter-connected simplexes. Note that loss, we trained a simplicial complex using SPRO, in this case not all modes are in shared simplexes with all K(S ,S ), forming two simplexes (w0,θ0,θ1,θ2) (w1,θ0,θ1,θ2) connecting points. These results clearly demonstrate that containing three connecting vertices θ , θ , θ between the 0 1 2 SPRO is capable of finding intricate and multi-dimensional two fixed points, w and w , which are pre-trained models. 0 1 structure within the loss surface. As a broader takeaway, Figure3 shows loss surface visualizations of this simplicial any mode we find through standard training is a single complex in the parameter space of a VGG16 network trained point within a large and high dimensional structure of loss, on CIFAR-10. We see that this complex contains not only as shown in the rightmost representation in Figure1. We standard mode connecting paths, but also volumes of low consider the accuracy of ensembles found via these mode loss that connect modes. Figure3 is a straightforward repre- connecting simplexes in Appendix B.3. In Section 5.4 we sentation of how the loss landscape of large neural networks consider a particularly practical approach to ensembling should be understood as suggested in Figure1; not only are with SPRO. Loss Surface Simplexes

2.00

1 1.75 2

1.50 w Cross Entropy Loss w0 0 0 0 1.25

1.00

2 2 0.75

0.50 (a) 4 modes, 3 connectors. (b) 7 modes, 9 connectors. w0 1 0 1 0.25 Figure 4. (a,b) Three dimensional projections of mode connecting 0.00 simplicial complexes with training modes shown in blue and con- nectors in orange. Blue shaded regions represent regions of low loss found via SPRO. (a) 4 modes and 3 connecting points found Figure 6. Loss surface visualizations of the faces of a sample ES- with a VGG16 network on CIFAR-100. (b) 7 modes and a total of PRO 3-simplex for a VGG network trained on CIFAR-100. The 9 connecting points found with a VGG16 network on CIFAR-10. ability to find a low-loss simplex starting from only a single SGD solution, w0, leads to an efficient ensembling procedure.

* 104 0.0 0.1 1 10 solutions for this architecture and dataset is at least 10, as Volume 10 2 adding an eleventh point collapses the volume. Simplicial Complex 2 4 6 8 10 12 Number of Connecting Points 5. ESPRO: Ensembling with SPRO

Figure 5. Volume of the simplicial complex as a function of the The ability to find large regions of low loss solutions has sig- number of connectors for a VGG net on CIFAR-10 for two settings nificant practical implications: we show how to use SPRO to λ of SPRO regularization. After 10 connectors, the volume col- efficiently create ensembles of models either within a single lapses, indicating that new points added to the simplicial complex simplex or by connecting an entire simplicial complex. We are within the span of previously found vertices. The low-loss start by generalizing the methodology presented in Section 10 manifold must be at least in this instance. 3.2, leading to a simplex based ensembling procedure, we call ESPRO (Ensembling SPRO). Crucially, our approach 4.3. Dimensionality of Loss Valleys finds a low-loss simplex starting from only a single SGD solution. We show that the different parameters in these We can estimate the highest dimensionality of the connect- simplexes gives rise to a diverse set of functions, which is ing space that SPRO can find, which provides a lower bound crucial for ensembling performance. Finally, we demon- on the true dimensionality of these mode connecting sub- strate that ESPRO outperforms state-of-the-art deep ensem- spaces for a given architecture and dataset. To measure bles (Lakshminarayanan et al., 2017), both as a function of dimensionality, we take two pre-trained modes, w0 and ensemble components and total computational budget. In w1, and construct a connecting simplex with as many con- Section6, we show ESPRO also provides state-of-the-art necting points as possible, by finding the largest k such results for uncertainty representation. that K(S(w0,θ0,...,θk),S(w1,θ0,...,θk)) contains both low loss parameter settings and has non-zero volume. We could 5.1. Finding Simplexes from a Single Mode continue adding more degenerate points to the simplex; however, the resulting simplicial complex has no volume. In Section 3.2 we were concerned with finding a simplicial complex that connects multiple modes. We now describe Figure5 shows the volume of a simplicial complex connect- how to adapt SPRO into a practical approach to ensembling ing two modes as a function of the number of connecting by instead finding multiple simplexes of low loss, each — points, k, for a VGG16 network on CIFAR-10. To ensure crucially — starting from a single pre-trained SGD solution. these are indeed low-loss complexes, we sample 25 models from each of these simplicial complexes and find that all Simplexes contain a single mode, and take the form th sampled models achieve greater than 98% accuracy on the S(wj ,θj,0,...,θj,k) where the θj,k is the k found with train set. We can continue adding new modes until we reach SPRO in a simplex where one of the vertices is mode wj. k = 11, when the volume collapses to approximately 10−4, We find SPRO simplexes one at a time, rather than as a from a maximum of 105. Thus the manifold of low loss complex. The associated loss function to find the kth vertex Loss Surface Simplexes in association with mode wj is 1 X Lreg(D,S ) = L(D, φh)− (wj ,θj,0,...,θj,k) H φh∼S (3)

λi log(V(S(wj ,θj,0,...,θj,k))).

For compactness we write φh ∼ S to indicate φh is sampled uniformly at random from simplex S(wj ,θj,0,...,θj,k). We can think of this training procedure as extending out Figure 7. Functional diversity within a simplex. We show the from the pre-trained mode wj. First, in finding θj,0 we decision boundaries for two classes, in the two spirals problem, find a of low loss solutions, where one end with predictions in yellow and purple respectively. Both plots of the line is wj. Next, with θj,0 fixed, we seek θj,1 such are independent solution samples drawn from a 3-simplex of an 8-layer feed forward classifier and demonstrate that the simplexes that the formed by wj, θj,0, θj,1 contains low loss solutions. We can continue adding vertices, constructing have considerable functional diversity, as illustrated by different many dimensional simplexes. decision boundaries. Significant differences are visible inside the data distribution (center of plots) and outside (around the edges).

With the resulting simplex S(wj ,θj,0,...,θj,k), we can sample as many models from within the simplex as we need, and use them to form an ensemble. Functionally, ensembles simplicial complex) to approximate a multimodal posterior, sampled from SPRO form an approximation to Bayesian towards a more accurate Bayesian model average. This ob- marginalization over the model parameters where we as- servation is similar to how Wilson & Izmailov(2020) show sume a posterior that is uniform over the simplex. We can that deep ensembles provide a compelling approximation define our prediction for a given input x as, to a Bayesian model average (BMA), and improve about 1 X Z deep ensembles through the MultiSWAG procedure, which yˆ = f(x, φm) ≈ f(x, φh)dφh, (4) uses a mixture of Gaussians approximation to the posterior. M φm∈S φm∼S ESPRO further improves the approximation to the BMA, by covering a larger region of the posterior corresponding where we write S as shorthand for S(wj ,θj,0,...,θj,k). Specif- ically, the Bayesian model average and its approximation to low loss solutions with functional variability. This per- using approximate posteriors is spective helps explain why ESPRO improves both accuracy and calibration, through a richer representation of epistemic Z Z uncertainty. p(y∗|y, M) = p(y∗|φ)p(φ|y)dφ ≈ p(y∗|φ)q(φ|y)dφ We verify the ability of ESPRO to find a simplex of low loss M 1 X ∗ starting from a single mode in Figure6, which shows the ≈ p(y |φi); φi ∼ q(φ|y) M loss surface in the planes defined by the faces of a 3-simplex i=1 found in the parameter space of a VGG16 network trained 5.2. ESPRO: Ensembling over Multiple Independent on CIFAR-100. The ability to find these simplexes is core to Simplexes forming ESPRO ensembles, as they only take a small num- ber of epochs to find, typically less than 10% the cost of We can significantly improve performance by ensembling training a model from scratch, and they contain diverse solu- from a simplicial complex containing multiple disjoint sim- tions that can be ensembled to improve model performance. plexes, which we refer to as ESPRO (Ensembling over Notably, we can sweep out a volume of low loss in param- SPRO simplexes). To form such an ensemble, we take a col- eter space without needing to first find multiple modes, in lection of j parameter vectors from independently trained contrast to prior work on mode connectivity (Draxler et al., models, w0, . . . , wj, and train a k + 1-order simplex at 2018; Garipov et al., 2018; Fort & Jastrzebski, 2019). We each one using ESPRO. This procedure defines the sim- show additional results with image transformers (Dosovit- plicial complex K(S(w0,...,θ0,k),...,S(wj ,...,θj,k)), which is skiy et al., 2020) on CIFAR-100 in Appendix B.2. composed of j disjoint simplexes in parameter space. Pre- dictions with ESPRO are generated as, 5.3. SPRO and Functional Diversity Z 1 X In practice we want to incorporate as many diverse high yˆ = f(x, φj) ≈ f(x, φj)dφj (5) J K accuracy classifiers as possible when making predictions to φj ∼K gain the benefits of ensembling, such as improved accuracy where K is shorthand for K(S(w0,...,θ0,k),...,S(wj ,...,θj,k)). and calibration. SPRO gives us a way to sample diverse ESPRO can be considered a mixture of simplexes (e.g. a models in parameter space, and in this section we show, Loss Surface Simplexes using a simple 2D dataset, that the parameter diversity found models is generally inexpensive; the models in Figure8 are with SPRO is a reasonable proxy for the functional diversity trained on CIFAR-10 for 200 epochs and CIFAR-100 for we actually seek. 300 epochs. Adding a vertex takes only an additional 10 epochs of training on CIFAR-10, and 20 epochs of training To better understand how the simplexes interact with the on CIFAR-100. We show the CIFAR-100 time-accuracy functional form of the model, we consider an illustrative tradeoff in AppendixB finding a similar trend to CIFAR- 10. example on the two-spirals classification dataset presented in Huang et al.(2019), in which predictions can be easily visualized. We find a 3-simplex (a ) in the param- 6. Uncertainty and Robustness eter space of a simple 8 layer deep feed forward classifier, We finish by investigating the uncertainty representation and visualize the functional form of the model for both sam- and robustness to dataset shift provided by ESPRO. We ples taken from within the simplex in parameter space. By show qualitative results on a regression problem, before examining the functional form of models sampled from sim- closing with a comparison on corruptions of CIFAR-10, plexes in parameter space we can quickly see why ESPRO comparing to deep ensembles, MultiSWA, and the state-of- is beneficial. Figure7 shows individual models sampled the-art Bayesian approach MultiSWAG (Wilson & Izmailov, from a single 3-simplex in parameter space, corresponding 2020). to clear functional diversity. Models within the simplex all fit the training data nearly perfectly but do so in distinct ways, such that we can improve our final predictions by 6.1. Qualitative Regression Experiments averaging over these models. In general, we want uncertainty to grow as we move outside of the data, to have good performance under dataset shift. 5.4. Performance of Simplicial Complex Ensembles Visualizing the growth in uncertainty is most straightforward in simple one-dimensional regression problems. Section 5.3 shows that we are able to discover simplexes in parameter space containing models that lead to diverse Izmailov et al.(2019) visualize one dimensional regres- predictions, meaning that we can ensemble within a sim- sion uncertainty by randomly initializing a two layer neural plex and gain some of the benefits seen by deep ensembles network, evaluating the neural network on three disjoint (Lakshminarayanan et al., 2017). We use SPRO to train sim- random inputs in one : (−7, −5), (−1, 1), and plicial complexes containing a number of disjoint simplexes, (5, 7), and adding noise of σ2 = 0.1 to the net’s outputs. and ensemble over these complexes to form predictions, us- The task is to recover the true noiseless function, f, given ing Eq.5. For this experiment we fix the number of samples another randomly initialized two layer network, as well as taken from the ESPRO ensemble, J, to 25 which provides to achieve reasonable confidence bands in the regions of the best trade off of accuracy vs test time compute cost.3 missing data — we used a Gaussian likelihood with fixed σ2 = 0.1 to train the networks, modelling the noisy data For example, if we are training a deep ensemble with 3 y. In Figure9, we show the results of ESPRO (top left) ensemble components on CIFAR-10, we can form a deep which recovers good qualitative uncertainty bands on this ensemble to achieve an error rate of approximately 6.2%; task. We compare to deep ensembles (size 5) (top right) however, by extending each base model to just a simple and the state of the art subspace inference method of Iz- 2-simplex (3 vertices) we can achieve an error rate of ap- mailov et al.(2019) (bottom left), finding that ESPRO does proximately 5.7% — an improvement of nearly 10%! a better job of recovering uncertainty about the latent func- After finding a mode through standard training, a low order tion f than either competing method, as shown by the 2σ simplex can be found in just a small fraction of the time confidence region about p(f|D). Indeed, even taking into it takes to train a model from scratch. For a fixed training account the true noise, ESPRO complexes also do a bet- budget, we find that we can achieve a much lower error rate ter job of modelling the noisy responses, y, measured by through training fewer overall ensemble components, but p(y|D) than either approach. training low order simplexes (order 0 to 2) at each mode us- ing ESPRO. Figure8 shows a comparison of test error rate 6.2. NLL, Calibration, and Accuracy under Dataset for ensembles of VGG16 models over different numbers Shift of ensemble components and simplex sizes on CIFAR-10 and CIFAR-100. For any fixed ensemble size, we can gain Modern neural networks are well known to be poorly cali- performance by using a ESPRO ensemble rather than a stan- brated and to result in overconfident predictions. Following dard deep ensemble. Furthermore, training these ESPRO Ovadia et al.(2019), we consider classification accuracy, the negative log likelihood (NLL), and expected calibra- 3We show the relationship between samples from the simplex tion error (ECE), to asses model performance under varying and test error in Appendix B.1. amounts of dataset shift, comparing to deep ensembles (Lak- Loss Surface Simplexes

CIFAR-10 CIFAR-10 CIFAR-100 7.0 7.0 28

6.5 6.5 26

6.0 6.0 24 Test Error (%) Test Error (%) Test Error (%)

0 10000 20000 30000 40000 50000 2 4 6 8 2 4 6 8 Total Train Time Number of Simplexes Number of Simplexes Deep Ensemble 1-Simplex ESPRO 2-Simplex ESPRO

Figure 8. Performance of deep ensembles and ESPRO (with either a 1-simplex, e.g. a line or a 2-simplex, e.g. a triangle) in terms of total train time and the number of simplexes (number of ensembles). Left: Test error as a function of total training budget on CIFAR-10. The number of components in the ensembles increases as curves move left to right. For any given training budget, ESPRO outperforms deep ensembles. Center: Test error as a function of the number of simplexes in the ensemble on CIFAR-10. A comparison of performance of ESPRO models on CIFAR-10 (left) and CIFAR-100 (right) of VGG16 networks with various numbers of ensemble components along the x-axis, and various simplex orders indicated by color. For any fixed number of ensemble components we can outperform a standard deep ensemble using simplexes from ESPRO. Notably, expanding the number of vertices in a simplex takes only 10 epochs of training on CIFAR-10 compared to the 200 epochs of training required to train a model from scratch. On CIFAR-100 adding a vertex to an ESPRO simplex takes just 20 epochs of training compared to 300 to train from scratch.

ESPRO Deep Ensembles that ESPRO with temperature scaling outperforms all other 1.0 1.0 methods for all corruption levels. We show ECE and results 0.5 0.5 Y Y 0.0 0.0 across other types of dataset corruption in Appendix C.1.

0.5 0.5

7.5 5.0 2.5 0.0 2.5 5.0 7.5 7.5 5.0 2.5 0.0 2.5 5.0 7.5 X X 7. Discussion Subspace Inference (y| ) 1.0 We have shown that the loss landscapes for deep neural Simplex Vertices & 0.5 Ensemble Components networks contain large multi-dimensional simplexes of low Y p(f| ) 0.0 p(y| ) loss solutions. We proposed a simple approach, which we 0.5 term SPRO, to discover these simplexes. We show how 7.5 5.0 2.5 0.0 2.5 5.0 7.5 X this geometric discovery could be leveraged to develop a highly practical approach to ensembling, which works by Figure 9. Qualitative uncertainty plots of p(f|D) on a regression sampling diverse and low loss solutions from the simplexes. problem. We show both the 2σ confidence regions from p(f|D) Our approach improves upon state-of-the-art methods in- (the latent noise-free function) and p(y|D), which includes the ob- cluding deep ensembles and MultiSWAG, in accuracy and served noise of the data (aleatoric uncertainty). Top Left: ESPRO, robustness. colored lines are the vertices in the simplex. First two are fixed points in the simplex. Top Right Deep ensembles, colored lines Overall, this paper provides a new understanding of how are individual models. Bottom Left Curve subspaces. ESPRO the loss landscapes in deep learning are structured: rather solutions produce functionally diverse solutions that have good than isolated modes, or basins of attraction connected by in-between (between the data distribution) and extrapolation (out- thin tunnels, there are large multidimensional manifolds of side of the data distribution) uncertainties; the ESPRO predictive connected solutions. distribution is broader and more realistic than deep ensembles and mode-connecting subspace inference, by containing a greater This new understanding of neural network loss landscapes variety of high performing solutions. has many exciting practical implications and future direc- tions. We have shown we can build state-of-the-art ensem- bling approaches from low less simplexes. In the future, shminarayanan et al., 2017), MultiSWA, and MultiSWAG one could build posterior approximations that these (Wilson & Izmailov, 2020), a state-of-the-art approach to simplexes, but also extend coverage to lower density points Bayesian deep learning which generalizes deep ensembles. for a more exhaustive Bayesian model average. We could In Figure 10a, we show results across all levels for the Gaus- also imagine creating models that are significantly sparser sian noise corruption, where we see that ESPRO is most than standard neural networks, with functional variability accurate across all levels. For NLL we use temperature defined by the simplicial complexes we have discovered scaling (Guo et al., 2017) on all methods to reduce the over- here. We could additionally build stochastic MCMC meth- confidence and report the results in Figure 10b. We see ods designed to navigate specifically in these subspaces of Loss Surface Simplexes

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5

0.800 0.680 0.550 0.610 0.890 0.545 0.675 0.795 0.605 0.540 0.885 0.670 0.600 0.790 0.535 0.880 0.665 0.595 0.530 0.785 0.525 Accuracy 0.875 0.660 0.590

0.780 0.520 0.655 0.870 0.585 0.515 0.775 0.650 0.580 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of Models Number of Models Number of Models Number of Models Number of Models MultiSWA MultiSWAG Deep Ensembles ESPRO

(a) Accuracy for Gaussian noise corruption Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 4400 16000 7600 12500 19000 15500 4200 12250 18500 7400 12000 15000 18000 4000 7200 11750 17500 7000 14500 3800 11500 17000 NLL

6800 11250 14000 16500 3600 11000 16000 6600 13500 3400 10750 15500 6400 10500 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of Models Number of Models Number of Models Number of Models Number of Models MultiSWA MultiSWAG Deep Ensembles ESPRO

(b) NLL for Gaussian noise corruption

Figure 10. (a) Accuracy for Gaussian blur corruption for MultiSWA, MultiSWAG, deep ensembles and ESPRO. (b) NLL under the same corruption. All models were originally significantly over-confident so we use temperature scaling (Guo et al., 2017) to improve uncertainty; after temperature scaling ESPRO generally performs the best under varying levels of corruption. low loss but diverse solutions. These types of topological scape. In International Conference on Machine Learning, features in the loss landscape, which are very distinctive to pp. 1309–1318, 2018. neural networks, hold the keys to understanding generaliza- tion in deep learning. Fort, S. and Jastrzebski, S. Large scale structure of neural network loss landscapes. In Advances in Neural Infor- References mation Processing Systems, volume 32, pp. 6709–6717, 2019. Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. The loss surfaces of multilayer net- Freeman, C. D. and Bruna, J. Topology and geometry works. In Artificial intelligence and , pp. 192– of half-rectified network optimization. In International 204. PMLR, 2015. Conference on Learning Representations, 2017. URL arXiv:1611.01540. Colins, K. D. Cayley-menger . From MathWorld–A Wolfram Web Resource, created by Eric Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., W. Weisstein, https://mathworld.wolfram.com/Cayley- and Wilson, A. G. Loss surfaces, mode connectivity, and MengerDeterminant.html. fast ensembling of dnns. Advances in Neural Information Processing Systems, 31:8789–8798, 2018. Czarnecki, W. M., Osindero, S., Pascanu, R., and Jader- berg, M. A deep neural network’s loss surface con- Goodfellow, I. J., Vinyals, O., and Saxe, A. M. Qualita- tains every low-dimensional pattern. arXiv preprint tively characterizing neural network optimization prob- arXiv:1912.07559, 2019. lems. arXiv preprint arXiv:1412.6544, 2014. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On Heigold, G., Gelly, S., et al. An image is worth 16x16 calibration of modern neural networks. In International words: Transformers for image recognition at scale. arXiv Conference on Machine Learning, volume 70, pp. 1321– preprint arXiv:2010.11929, 2020. 1330. PMLR, 2017. Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, Hochreiter, S. and Schmidhuber, J. Flat minima. Neural F. Essentially no barriers in neural network energy land- Computation, 9(1):1–42, 1997. Loss Surface Simplexes

Huang, W. R., Emam, Z., Goldblum, M., Fowl, L., Terry, J. K., Huang, F., and Goldstein, T. Understanding generalization through visualizations. arXiv preprint arXiv:1906.03291, 2019. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. Averaging weights leads to wider optima and better generalization. In Uncertainty in Artificial Intelligence, 2018. URL arXiv:1803.05407. Izmailov, P., Maddox, W. J., Kirichenko, P., Garipov, T., Vetrov, D., and Wilson, A. G. Subspace inference for bayesian deep learning. In Uncertainty in Artificial Intel- ligence, pp. 1169–1179. PMLR, 2019. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. In Inter- national Conference on Learning Representations, 2017. URL arXiv:1609.04836. Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30:6402–6413, 2017. Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets. In Advances in neural information processing systems, pp. 6389–6399, 2018. Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for bayesian uncer- tainty in deep learning. Advances in Neural Information Processing Systems, 32:13153–13164, 2019. Maddox, W. J., Benton, G., and Wilson, A. G. Rethinking parameter counting in deep models: Effective dimension- ality revisited. arXiv preprint arXiv:2003.02139, 2020. Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshminarayanan, B., and Snoek, J. Can you trust your model’s uncertainty? evaluat- ing predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, volume 32, pp. 13991–14002, 2019. Simonyan, K. and Zisserman, A. Very deep convolu- tional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. Skorokhodov, I. and Burtsev, M. Loss landscape sight- seeing with multi-point optimization. arXiv preprint arXiv:1910.03867, 2019. Wilson, A. G. and Izmailov, P. Bayesian deep learning and a probabilistic perspective of generalization. In Advances in Neural Information Processing Systems, volume 33, 2020. Appendix for Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling

Outline The Appendix is outlined as follows: In Appendix A.1, we give a more detailed description of our methods, focusing first on computing the simplex volume and sampling from the simplexes, then describe vertex initialization and regularization, giving training details, and finally describing the training procedure for multi-dimensional mode connectors. In Appendix A.2, we describe several more results on volume and ensembling, particularly on the number of samples required for good performance with SPRO and ESPRO. Finally, in AppendixC, we plot the results of a larger suite of corruptions on CIFAR- 10 for ESPRO, deep ensembles, and MultiSWAG.

W0 W1 W2 W3 W0 W1 W2 W3 W0 W1 W2 W3

Figure A.1. A simplified version of the progressive understanding of the loss landscape of neural networks. Left: The traditional view in which low loss modes are disconnected in parameter space. Center: The updated understanding provided by works such as Draxler et al. (2018), Fort & Jastrzebski(2019), and Garipov et al.(2018), in which modes are connected along thin paths or tunnels. Right: The view we present in this work: independently trained models converge to points on the same volume of low loss.

A. Extended Methodology First, we present a two dimensional version of the schematic in Figure1 in A.1, which explains the same progressive illustration, but in two dimensions.

A.1. Simplex Volume and Sampling We employ simplexes in the loss surface for two reasons primarily:

• sampling uniformly from within a simplex is straightforward, meaning we can estimate the expected loss within any found simplexes easily,

• computing the Volume of a simplex is efficient, allowing for regularization encouraging high-Volume simplexes. Loss Surface Simplexes

1.0 1.0 0.8 0.8 X3 0.6 X3 0.6 0.4 0.4 0.2 0.0 0.2 0 0.2 20 0.0 0.4 0.0 40 0.6 60 0.0 0.2 0.8 X1 0.0 0.2 80 X1 0.4 0.6 1.0 0.4 0.6 100 X2 0.8 1.0 X2 0.8 1.0

Figure A.2. Left: 100 samples drawn uniformly from within the unit simplex. Right: 100 samples drawn from a non-unit simplex (note the scale of the X1 axis). The distribution of points in both simplexes is visually indistinguishable — evidence that the method for sampling from a unit simplex is sufficient to draw samples from arbitrary simplexes.

Sampling from Simplexes: Sampling from the standard simplex is just a specific case of sampling from a with concentration parameters all equal to 1. The standard n-simplex is a simplex is a simplex formed by the vectors v0,..., vn such that the vi’s are the standard unit vectors. Therefore, to draw samples from a standard n-simplex in a d dimensional space with vertices v0,..., vn, we follow the same procedure to sample from a Dirichlet distribution.

T i.i.d. yi Pn To sample vector x = [x0, . . . , xd] we first draw y0, . . . , yn ∼ Exp(1), then set y˜i = Pd . Finally, x = i=1 y˜ivi. j=1 yj While this method is sufficient for simulating vectors uniformly at random from the standard simplex, there is no guarantee that such a sampling method produces uniform samples from an arbitrary simplex, and thus samples of the loss over the simplex that we use in Equation1 may not be an unbiased estimate of the expected loss over the simplex. Practically, we do not find this to be an issue, and are still able to recover low loss simplexes with this approach. Furthermore, Figure A.2 shows that the distribution of samples in a unit simplex is visually similar to the samples from an elongated simplex where we multiply one of the basis vectors by a factor of 100. This figure serves to show that although there may be some bias in our estimate of the loss over the simplex in Equation1, it should not be (and is not in practice) limiting to our optimization routine. Note too, this may appear like a simplistic case, but typically the simplexes found by SPRO contain only a small number of vertices, so a 2-simplex whose lengths vary by a factor of nearly 100 is a reasonable comparison to a scenario we may find in practice.

Computing Simplex Volume: Simplex Volumes can be easily computed using Cayley-Menger determinants (Colins). If we have an n-simplex defined by the parameter vectors w0, . . . , wn the Cayley-Menger determinant is defined as

2 2 0 d01 ··· d0n 1 2 2 d 0 ··· d n 1 01 0 ...... CM(w0, . . . , wn) = ...... (A.1)

d2 d2 ··· 0 1 n0 n1 1 1 1 1 1

The Volume of the simplex S(w0,...,wn) is then given as

n+1 2 (−1) V(S ) = CM(w , . . . , wn). (A.2) w0,...,wn) (n!)22n 0 While in general we may be adverse to computing determinants or factorial terms the simplexes we work with in this paper Loss Surface Simplexes

7.0

6.8

Simplex Vertices 6.6 2 3 4 Test Error 6.4 5

6.2

6.0 10 9 10 7 10 5 10 3 10 1 101 Regularization Parameter, *

Figure A.3. CIFAR-10 test accuracy as a function of regularization parameter λ∗ and colored by the number of vertices. Accuracy is essentially unchanged for the various regularization parameters. are generally low order (all are under 10 vertices total) meaning that computing the Cayley-Menger determinants is generally a quite fast and stable computation.

A.2. Initialization and Regularization Vertex Initialization: We initialize the jth parameter vector corresponding to a vertex in the simplex as the mean of the 1 Pj−1 previously found vertices, wj = j i=0 wi and train using the regularized loss in Eq.1.

Regularization Parameter: As the order of the simplex increases, the Volume of the simplex increases exponentially. Thus, we define a distinct regularization parameter, λj, in training each θj to provide consistent regularization for all vertices. ∗ To choose the λk’s we define a λ and compute λ∗ λ = , (A.3) k log V(K) where K is randomly initialized simplicial complex of the same structure that the simplicial complex will have while training θj. Eq. A.3 normalizes the λk’s such that they are similar when accounting for the exponential growth in volume as the order of the simplex grows. In practice we need only small amounts of regularization, and choose λ∗ = 10−8. Since we are spanning a space of near constant loss any level of regularization will encourage finding simplexes with non-trivial Volume. Finally, when dealing with models that use batch normalization, we follow the procedure of Garipov et al.(2018) and compute several forwards passes on the training data for a given sample from the simplex to update the batch normalization statistics. For layer normalization, we do not need to use this procedure as layer norm updates at test time.

A.3. Training Details Throughout, we used VGG-16 like networks originally introduced in Simonyan & Zisserman(2014) from https: //github.com/pytorch/vision/blob/master/torchvision/models/vgg.py. For training, we used standard normalization, random horizontal flips, and crops and a batch size of 128. We used SGD with momentum = 0.9, and a cosine annealing learning rate with a single cycle, a learning rate of 0.05, and weight decay 5e − 4 training for 300 epochs for the pre-trained VGG models. For SPRO, we used a learning rate of 0.01 and trained for 20 epochs for each connector. In our experiments with transformers, we used the ViT-B 16 image transformer model (Dosovitskiy et al., 2020) pre-trained on ImageNet from https://github.com/jeonsworld/ViT-pytorch and trained on CIFAR100 with upsampled image size of 224 with a batch size of 512 for 50000 steps (the default fine-tuning on CIFAR-100). Again, we used random flips and crops for data augmentation. Loss Surface Simplexes

Test Error vs Samples from Simplex

0.286

0.284

0.282 Test Error

0.280

0 20 40 60 80 100 Number of Samples

Figure A.4. Test error vs. number of samples, J, in the ensemble on CIFAR-100 using a VGG16 network and a 3-simplex trained with SPRO. For any number of components in the SPRO ensemble greater than approximately 25 we achieve near constant test error.

To train these SPRO models, we used a learning rate of 0.001 and trained with SGD for 30 epochs for each connector, using 20 samples from the simplex at test time.

A.4. Multi-Dimensional Mode Connectors To train the multi-dimensional SWAG connectors, we connected two pre-trained networks following Garipov et al.(2018) using a piece-wise linear curve, trained for 75 epochs with an initial learning rate of 0.01, decaying the learning rate to 1e − 4 by epoch 40. At epoch 40, we reset the learning rate to be constant at 5e − 3. The final individual sample accuracy (not SWA) was 91.76%, which is similar to the final individual sample accuracies for standard training of VGG networks with SWAG. We used random crops and flips for data augmentation.

B. Extended Volume and Ensembling Results B.1. Test Error vs. Simplex Samples SPRO gives us access to a whole space of model parameters to sample from rather than just a discrete number of models to use as in deep ensembles. Therefore a natural question to ask is how many models and forwards passes need to be sampled from the simplex to achieve the highest accuracy possible without incurring too high of a cost at test time. Figure A.4 shows that for a VGG16 network trained on CIFAR-100 we achieve near constant accuracy for any number of ensemble components greater than approximately 25. Therefore, for the ensembling experiments in Section 5.4 we use 25 samples from each simplex to generate the SPRO ensembles. In this work we are not focused on the issue of test time compute cost, and if that were a consideration for deployment of a SPRO model we could evaluate the trade-off in terms of test time compute vs accuracy, or employ more sophisticated methods such as ensemble distillation.

B.2. Loss Surfaces of Transformers Next, we show the results of training a SPRO 3−simplex with an image transformer on CIFAR-100 (Dosovitskiy et al., 2020) in Figure A.5. Due to computational requirements, the transformer was pre-trained on ImageNet before being finetuned on CIFAR-100 for 50, 000 minibatches. We then trained each vertex for an additional 10 epochs. Due to the inflexibility of the architecture, the volume of the simplex found is much smaller, approximately 10−21. This resulted in some instability through training, possibly preventing much benefits in ensembling transformer models, as shown in Figure A.8. However, these results do demonstrate that a significant region of low loss can be found in subspaces of transformer models, and further work will be necessary to efficiently exploit these regions of low loss, much like has been done with conv nets and Loss Surface Simplexes

1.04

0.96

0.88 Cross Entropy Loss

0.80

1 2 2 2 0.72

w0 0 w0 0 w0 1 0 1 0.64 0.56

0.48

0.40

0.32

Figure A.5. Loss surface visualizations of the faces of a sample ESPRO 3-simplex for a Transformer architecture (Dosovitskiy et al., 2020) fine-tuned on CIFAR-100. Here, the volume is considerably smaller, but a low loss region is found.

CIFAR-10 Test Error CIFAR-100 Test Error 0.30

0.08 Connectors 0.28 1 0.07 2 0.26 3 Test Error Test Error

0.06 0.24

2 3 4 5 2 3 4 5 Modes Modes

Figure A.6. Test error for mode connecting simplexes that connect various numbers of modes through various numbers of connecting points in the parameter space of VGG16 networks trained on CIFAR-10 and CIFAR-100. The error rates of baseline models are shown as horizontal dotted lines. In general the highest performing models are those with the fewest modes and the fewest connecting points, but the performance gaps between configurations are small.

ResNets.

B.3. Ensembling Mode Connecting Simplexes We can average predictions over these mode connecting volumes, generating predictions as ensembles, yˆ = 1 P f(x, φ ), where φ ∼ K indicates we are sampling models uniformly at random from the simplicial com- H φh∼K h h plex K(S(w0,θ0,... ),S(w1,θ0,... ),... ). Test error for such ensembles for volumes in the parameter space of VGG16 style networks on both CIFAR-10 and CIFAR-100 are given in Figure A.6. We see that while some improvements over baselines can be made, mode connecting simplexes do not lead to highly performant ensembles.

B.4. Ensembling Modes of SPRO Figure A.7 presents the results of Figure8 in the main text, but against the total training budget rather than the number of ensemble components. We see from the plot that on either dataset, for nearly any fixed training budget the most accurate model is an ESPRO model, even if that means using our budget to train ESPRO simplexes but fewer models overall. Times correspond to training models sequentially on an NVIDIA Titan RTX with 24GB of memory. Finally, Figure A.8 presents the results of ensembling with SPRO using state of the art transformers architectures on CIFAR-100 (Dosovitskiy et al., 2020). We find, counterintuitively that there is only a very small performance difference from ensembling with SPRO compared to the base architecture. We suspect that this is because it is currently quite difficult Loss Surface Simplexes

CIFAR-10 CIFAR-100 7.0 28

Number of Vertices 6.5 26 1 2 3 6.0 24 Test Error (%) Test Error (%)

10000 20000 30000 40000 50000 60000 20000 40000 60000 80000 100000 Total Training Time Total Training Time

Figure A.7. Test error of ESPRO models on CIFAR-10 (left) and CIFAR-100 (right) as a function of total training time (training the original models and the ESPRO simplexes). The color of the curves indicate the number of the vertices in the simplex, and the points corresponding to increasing numbers of ensemble components moving left to right (ranging from 1 to 8). We see that on either dataset for nearly any fixed training budget, we are better off training fewer models overall and using ESPRO to construct simplexes to sample from.

0.305 Deep Ensembles 0.074 Simplex Size 2 0.300 Simplex Size 3 0.295 0.073 Simplex Size 4 Simplex Size 5 0.290

0.072 0.285 Test NLL 0.280 Test Error (%) 0.071 0.275

1 2 3 4 1 2 3 4 Number of Simplexes Number of Simplexes

Figure A.8. Test Error and NLL for the number of components in SPRO ensembles using image transformers on CIFAR-100. Comparing to deep ensembles, performance is quite similar, while ESPRO with four dimensional simplexes performs very slightly better. However, NLL is slightly worse. to train transformers without using significant amounts of unlabelled data.

C. Extended Uncertainty Results C.1. Further NLL and Calibration Results Finally, we include the results across 18 different corruptions for the ensemble components. In order, these are jpeg, fog, snow, brightness, pixelate, zoom blur, saturate, contrast, motion blur, defocus blur, speckle noise, gaussian blur, glass blur, shot noise, frost, spatter, impulse noise and, elastic transform. Loss Surface Simplexes

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.885 0.910 0.860 0.870 0.835 0.880 0.855 0.905 0.865 0.830 0.875 0.850 0.825 0.900 0.860 0.845

Accuracy 0.870 0.820 0.855 0.840 0.895 0.865 0.815 0.835 0.850 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

3600 6000 4400 4800 5200 3400 5800 4200 4600 5000 5600 3200 NLL 4400 4800 4000 5400 3000 4200 4600 3800 5200 4400 2800 4000 5000 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.045 0.05 0.05 0.06 0.030 0.040 0.04 0.025 0.035 0.04 0.05

ECE 0.030 0.020 0.03 0.03 0.04 0.025 0.015 0.020 0.02 0.02 0.03 0.010 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.9. Accuracy, NLL and ECE with increasing intensity of the jpeg compression corruption (from left to right).

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5

0.88 0.76 0.940 0.930 0.910 0.87 0.74 0.925 0.905 0.935 0.86 0.900 0.920 0.72 Accuracy 0.930 0.85 0.895 0.915 0.70 0.84 0.925 0.890 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

3600 4750 9000 2400 2800 3400 4500 8500 2200 2600 3200 4250 NLL 2400 8000 2000 3000 4000

2200 2800 3750 7500 1800 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.025 0.030 0.10 0.04 0.020 0.020 0.025 0.08 0.015 0.03 0.015 0.020 ECE

0.010 0.015 0.02 0.06 0.010

0.005 0.010 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.10. Accuracy, NLL and ECE with increasing intensity of the fog corruption (from left to right). Loss Surface Simplexes

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.910 0.840 0.86 0.82 0.905 0.835 0.85 0.83 0.830 0.81 0.900 0.825 0.84 0.80

Accuracy 0.895 0.82 0.820

0.890 0.815 0.83 0.79 0.81 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

3800 5750 6250 3600 5500 6000 7000 6000 5250 3400 5750 NLL 6500 5000 5500 3200 5500 4750 5250 6000 3000 5000 4500 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.08 0.040 0.07 0.07 0.06 0.035 0.07 0.06 0.06 0.030 0.05 0.06 ECE 0.05 0.025 0.05 0.04 0.05 0.020 0.04 0.04 0.03 0.015 0.04 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.11. Accuracy, NLL and ECE with increasing intensity of the snow corruption (from left to right).

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.945 0.915 0.940 0.935 0.940 0.910 0.940 0.935 0.930 0.905 0.935 0.925 0.935 0.930 0.900

Accuracy 0.930 0.920 0.895 0.925 0.930 0.915 0.890 0.925 0.920 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

3000 2400 3750 2600 2400 2800 3500 2200 2400 2200 2600 3250 NLL 2000 2200 2400 2000 3000 2000 2200 1800 2750 1800 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.030 0.025 0.025 0.030 0.025 0.04 0.020 0.020 0.025 0.020

ECE 0.03 0.015 0.015 0.020 0.015 0.015 0.010 0.010 0.02 0.010 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.12. Accuracy, NLL and ECE with increasing intensity of the brightness corruption (from left to right). Loss Surface Simplexes

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.935 0.83 0.900 0.82 0.64 0.930 0.895 0.91 0.81 0.890 0.62 0.925 0.80 0.90 0.885 Accuracy 0.79 0.60 0.920 0.880 0.78 0.89 0.58 0.875 0.77 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

15000 2800 3600 4200 7500 14500 4000 3400 2600 7000 3800 14000 3200 NLL 2400 3600 13500 3000 6500 3400 13000 2200 2800 6000 3200 12500 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.20 0.025 0.035 0.08 0.03 0.020 0.030 0.07 0.18

ECE 0.015 0.025 0.06 0.02 0.16

0.010 0.020 0.05 0.01 0.14 0.015 0.04 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.13. Accuracy, NLL and ECE with increasing intensity of the pixelate corruption (from left to right).

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.76 0.82 0.90 0.85 0.88 0.81 0.74 0.84 0.89 0.80 0.87 0.83 0.79 0.72 Accuracy 0.88 0.86 0.82 0.78 0.70 0.81 0.77 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

4250 7500 4750 6000 4000 9500 4500 7000 3750 4250 5500 9000

NLL 6500 3500 4000 5000 8500 3250 3750 6000 8000 3000 3500 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.08 0.04 0.06 0.030 0.11 0.05 0.07 0.025 0.10 0.03 0.06 0.020 0.04 0.09 ECE

0.015 0.02 0.05 0.08 0.03

0.010 0.04 0.07 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.14. Accuracy, NLL and ECE with increasing intensity of the zoom blur corruption (from left to right). Loss Surface Simplexes

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5

0.940 0.89 0.910 0.920 0.925 0.915 0.905 0.935 0.88 0.920 0.910 0.900 0.930 0.87 0.905 Accuracy 0.915 0.895 0.925 0.900 0.86 0.910 0.890 0.895 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

3000 2600 3600 3500 5000

2800 3400 2400 3250 4500 3000 NLL 2600 3200 2200 4000 3000 2750 2400 2000 2800 2500 3500 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.030 0.035 0.025 0.035 0.05 0.025 0.030 0.030 0.020 0.04 0.020 0.025 0.025 ECE 0.015 0.020 0.015 0.020 0.03

0.015 0.015 0.010 0.010 0.02 0.010 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.15. Accuracy, NLL and ECE with increasing intensity of the saturate corruption (from left to right).

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.90 0.835 0.23 0.67 0.830 0.22 0.935 0.89 0.825 0.66 0.21

0.930 0.820 0.20 0.65

Accuracy 0.88 0.815 0.19 0.925 0.64 0.810 0.18 0.87 0.805 0.63 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

11000 2600 4000 5800 10750 27000 3800 2400 5600 10500 3600

NLL 26000 2200 10250 5400 3400 10000 2000 5200 25000 3200 9750 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.035 0.05 0.030 0.42 0.020 0.12 0.40 0.025 0.04 0.015 0.10 0.38 ECE 0.020 0.03 0.36 0.015 0.010 0.08 0.34 0.010 0.02 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.16. Accuracy, NLL and ECE with increasing intensity of the contrast corruption (from left to right). Loss Surface Simplexes

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.81 0.81 0.92 0.87 0.80 0.80 0.72 0.86 0.79 0.79 0.91 0.78 0.78 0.70 0.85 Accuracy 0.77 0.77 0.90 0.84 0.76 0.76 0.68

2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

3400 7500 7500 10000 3200 5000 7000 9500 3000 7000 NLL 4500 2800 6500 9000 6500

2600 8500 4000 6000 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.030 0.12 0.08 0.05 0.08 0.11 0.025 0.07 0.07 0.04 0.10 0.020 ECE 0.06 0.06 0.09 0.015 0.03 0.05 0.05 0.08

0.010 0.02 0.04 0.07 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.17. Accuracy, NLL and ECE with increasing intensity of the motion blur corruption (from left to right).

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.91 0.935 0.66 0.940 0.84 0.90 0.930 0.64 0.935 0.925 0.82 0.89 0.62 Accuracy 0.920 0.930 0.88 0.80 0.60 0.915

2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

4000 6500 2400 2800 13000 3750 2600 6000 2200 12500 3500

NLL 2400 5500 12000 2000 3250 2200 3000 5000 11500 1800 2000 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.18 0.025 0.025 0.07 0.035 0.17 0.020 0.020 0.030 0.06 0.16 0.025 ECE 0.015 0.015 0.05 0.15 0.020 0.14 0.010 0.010 0.015 0.04 0.13 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.18. Accuracy, NLL and ECE with increasing intensity of the defocus blur corruption (from left to right). Loss Surface Simplexes

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.71 0.915 0.845 0.800 0.60 0.910 0.840 0.795 0.70 0.59 0.835 0.905 0.790 0.69 0.830 0.58 Accuracy 0.900 0.785 0.825 0.57 0.895 0.68 0.780 0.820 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

3600 6000 11500 7500 16000 3400 5750 7250 11000 3200 15000 5500 7000 NLL 10500 3000 5250 6750 14000 2800 5000 6500 10000

2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.035 0.10 0.16 0.24 0.07 0.030 0.09 0.22 0.025 0.06 0.08 0.14

ECE 0.20 0.020 0.05 0.07 0.12 0.18 0.015 0.06 0.04 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.19. Accuracy, NLL and ECE with increasing intensity of the speckle noise corruption (from left to right).

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.86 0.78 0.56 0.940 0.90 0.76 0.84 0.54

0.935 0.74 0.89 0.82 0.52 Accuracy 0.72 0.930 0.50 0.88 0.80 0.70 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

9500 2400 4000 17000 6000 9000 3750 2200 16500 3500 5500 8500

NLL 16000 2000 3250 8000 5000 15500 3000 1800 7500 15000 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.12 0.025 0.035 0.11 0.06 0.24 0.020 0.030 0.10 0.025 0.05 0.22 ECE 0.015 0.09 0.020 0.04 0.010 0.015 0.08 0.20

2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.20. Accuracy, NLL and ECE with increasing intensity of the Gaussian blur corruption (from left to right). Loss Surface Simplexes

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.48 0.53 0.69 0.61 0.63 0.47 0.68 0.52 0.60 0.62 0.67 0.46 Accuracy 0.59 0.61 0.51 0.66 0.45 0.60 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

15500 14500 12000 21000 15000 18000 14000 11500 14500 13500 20000 11000 14000 17000 NLL 13000 13500 10500 19000 12500 16000 13000 10000 12000 18000 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.16 0.20 0.22 0.32 0.26

0.18 0.14 0.20 0.30 0.24 ECE

0.16 0.12 0.28 0.18 0.22

0.26 0.14 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.21. Accuracy, NLL and ECE with increasing intensity of the glass blur corruption (from left to right).

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.71 0.755 0.60 0.915 0.875 0.70 0.750 0.910 0.59 0.870 0.69 0.745 0.58 0.905 0.68

Accuracy 0.865 0.740 0.57 0.900 0.67 0.860 0.735 0.56 0.895 0.66 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

4800 3400 11500 16000 4600 9000 3200 4400 11000 15000 NLL 3000 4200 8500 10500 2800 4000 14000 8000 10000 2600 3800 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.26 0.035 0.13 0.05 0.030 0.12 0.16 0.24

0.025 0.04 0.11 0.22 0.14 ECE 0.10 0.020 0.20 0.03 0.09 0.015 0.12 0.18 0.08 0.010 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.22. Accuracy, NLL and ECE with increasing intensity of the shot noise corruption (from left to right). Loss Surface Simplexes

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.82 0.73 0.925 0.890 0.72 0.82 0.81 0.920 0.885 0.71 0.915 0.880 0.81

Accuracy 0.80 0.70 0.910 0.875 0.905 0.80 0.79 0.69 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

3200 4400 10500 6750 7000 3000 4200 6500 6750 10000 2800 4000 6250 6500 NLL 3800 2600 6000 9500 6250 3600 5750 2400 6000 3400 5500 9000 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.12 0.030 0.040 0.07 0.06 0.11 0.025 0.035 0.06 0.030 0.05 0.10

ECE 0.020 0.05 0.025 0.09 0.015 0.04 0.020 0.04 0.08

0.010 0.015 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.23. Accuracy, NLL and ECE with increasing intensity of the frost corruption (from left to right).

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.890 0.855 0.85 0.925 0.77 0.885 0.850

0.920 0.880 0.845 0.84 0.76 0.915 0.875 0.840 Accuracy 0.83 0.910 0.870 0.835 0.75

0.865 0.830 0.905 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

6500 3200 9500 4500 5500 3000 9000 4250 5250 6000 2800

NLL 4000 5000 8500 5500 2600 3750 4750 8000 2400 5000 3500 4500 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.035 0.06 0.08 0.12 0.030 0.04 0.05 0.07 0.11 0.025 0.03 0.04 0.06 0.10 ECE 0.020 0.09 0.015 0.03 0.05 0.02 0.08 0.010 0.02 0.04 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.24. Accuracy, NLL and ECE with increasing intensity of the spatter corruption (from left to right). Loss Surface Simplexes

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.72 0.80 0.56 0.42

0.87 0.71 0.79 0.55 0.40

0.54 0.86 0.78 0.70 0.38 Accuracy

0.53 0.77 0.36 0.85 0.69

2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

8000 17000 10500 5000 16500 24000 7500 4750 10000 16000 22000 NLL 4500 15500 7000 9500 4250 15000 14500 20000 4000 6500 9000 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.09 0.24 0.40 0.05 0.08 0.12 0.22 0.07 0.35 0.04 0.20

ECE 0.10 0.06 0.30 0.03 0.18 0.05 0.08 0.16 0.02 0.04 0.25 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.25. Accuracy, NLL and ECE with increasing intensity of the impulse noise corruption (from left to right).

Corruption Level: 1 Corruption Level: 2 Corruption Level: 3 Corruption Level: 4 Corruption Level: 5 0.915 0.82 0.89 0.85 0.910 0.91 0.81 0.905 0.88 0.84 0.80 0.900 0.90 0.83

Accuracy 0.87 0.895 0.79 0.82 0.890 0.89 0.86 0.78 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

3800 3800 4500 3600 6000 3600 7000 4250 3400 3400 5500 3200 4000 6500 NLL 3200

3000 3750 3000 5000 3500 6000 2800 2800

2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8

0.030 0.030 0.06 0.07 0.04 0.025 0.025 0.05 0.06 0.020 0.020 0.03 0.05

ECE 0.04 0.015 0.015 0.04 0.02 0.03 0.010 0.010 0.03

2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 Number of Models Number of Models Number of Models Number of Models Number of Models

MultiSWA MultiSWAG Deep Ensembles ESPRO

Figure A.26. Accuracy, NLL and ECE with increasing intensity of the elastic transform corruption (from left to right).