Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling

Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling Gregory W. Benton 1 Wesley J. Maddox 1 Sanae Lotfi 1 Andrew Gordon Wilson 1 Abstract ✓<latexit sha1_base64="b2Ff/oUFJw0eznxXS1RygRK2bZk=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilTg9HHGnf65crbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fja/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4bWfCZWkyBVbLApTSTAms+fJQGjOUE4soUwLeythI6opQxtRyYbgLb+8Slq1qndRrd1fVuo3eRxFOIFTOAcPrqAOd9CAJjCQ8Ayv8OY8Oi/Ou/OxaC04+cwx/IHz+QPOtY/Q</latexit> 1 ✓<latexit sha1_base64="oe3DagNbCs6bjj10ybLfZH9d5SY=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilTg9HHGm/1i9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+75ScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZs+TgdCcoZxYQpkW9lbCRlRThjaikg3BW355lbRqVe+iWru/rNRv8jiKcAKncA4eXEEd7qABTWAg4Rle4c15dF6cd+dj0Vpw8plj+APn8wfQOY/R</latexit> 2 ✓<latexit sha1_base64="R5K03HuliXINm8N616K2JXm4/8Q=">AAAB73icbVBNS8NAEN34WetX1aOXxSJ4Kkkr6LHoxWMF+wFtKJvtpF262cTdiVBC/4QXD4p49e9489+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqDk0ey1h3AmZACgVNFCihk2hgUSChHYxvZ377CbQRsXrASQJ+xIZKhIIztFKnhyNA1q/1S2W34s5BV4mXkzLJ0eiXvnqDmKcRKOSSGdP13AT9jGkUXMK02EsNJIyP2RC6lioWgfGz+b1Tem6VAQ1jbUshnau/JzIWGTOJAtsZMRyZZW8m/ud1Uwyv/UyoJEVQfLEoTCXFmM6epwOhgaOcWMK4FvZWykdMM442oqINwVt+eZW0qhWvVqneX5brN3kcBXJKzsgF8cgVqZM70iBNwokkz+SVvDmPzovz7nwsWtecfOaE/IHz+QPRvY/S</latexit> 3 ✓<latexit sha1_base64="4AJkBN4moR95a55jEtdFmKULvZU=">AAAB73icbVBNS8NAEN34WetX1aOXxSJ4Kkkt6LHoxWMF+wFtKJvtpF262cTdiVBC/4QXD4p49e9489+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqDk0ey1h3AmZACgVNFCihk2hgUSChHYxvZ377CbQRsXrASQJ+xIZKhIIztFKnhyNA1q/1S2W34s5BV4mXkzLJ0eiXvnqDmKcRKOSSGdP13AT9jGkUXMK02EsNJIyP2RC6lioWgfGz+b1Tem6VAQ1jbUshnau/JzIWGTOJAtsZMRyZZW8m/ud1Uwyv/UyoJEVQfLEoTCXFmM6epwOhgaOcWMK4FvZWykdMM442oqINwVt+eZW0qhXvslK9r5XrN3kcBXJKzsgF8cgVqZM70iBNwokkz+SVvDmPzovz7nwsWtecfOaE/IHz+QPTQY/T</latexit> 4 ✓<latexit sha1_base64="b2Ff/oUFJw0eznxXS1RygRK2bZk=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilTg9HHGnf65crbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fja/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4bWfCZWkyBVbLApTSTAms+fJQGjOUE4soUwLeythI6opQxtRyYbgLb+8Slq1qndRrd1fVuo3eRxFOIFTOAcPrqAOd9CAJjCQ8Ayv8OY8Oi/Ou/OxaC04+cwx/IHz+QPOtY/Q</latexit> 1 ✓<latexit sha1_base64="oe3DagNbCs6bjj10ybLfZH9d5SY=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilTg9HHGm/1i9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+75ScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uw2s/EypJkSu2WBSmkmBMZs+TgdCcoZxYQpkW9lbCRlRThjaikg3BW355lbRqVe+iWru/rNRv8jiKcAKncA4eXEEd7qABTWAg4Rle4c15dF6cd+dj0Vpw8plj+APn8wfQOY/R</latexit> 2 ✓<latexit sha1_base64="R5K03HuliXINm8N616K2JXm4/8Q=">AAAB73icbVBNS8NAEN34WetX1aOXxSJ4Kkkr6LHoxWMF+wFtKJvtpF262cTdiVBC/4QXD4p49e9489+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqDk0ey1h3AmZACgVNFCihk2hgUSChHYxvZ377CbQRsXrASQJ+xIZKhIIztFKnhyNA1q/1S2W34s5BV4mXkzLJ0eiXvnqDmKcRKOSSGdP13AT9jGkUXMK02EsNJIyP2RC6lioWgfGz+b1Tem6VAQ1jbUshnau/JzIWGTOJAtsZMRyZZW8m/ud1Uwyv/UyoJEVQfLEoTCXFmM6epwOhgaOcWMK4FvZWykdMM442oqINwVt+eZW0qhWvVqneX5brN3kcBXJKzsgF8cgVqZM70iBNwokkz+SVvDmPzovz7nwsWtecfOaE/IHz+QPRvY/S</latexit> 3 ✓<latexit sha1_base64="4AJkBN4moR95a55jEtdFmKULvZU=">AAAB73icbVBNS8NAEN34WetX1aOXxSJ4Kkkt6LHoxWMF+wFtKJvtpF262cTdiVBC/4QXD4p49e9489+4bXPQ1gcDj/dmmJkXJFIYdN1vZ219Y3Nru7BT3N3bPzgsHR23TJxqDk0ey1h3AmZACgVNFCihk2hgUSChHYxvZ377CbQRsXrASQJ+xIZKhIIztFKnhyNA1q/1S2W34s5BV4mXkzLJ0eiXvnqDmKcRKOSSGdP13AT9jGkUXMK02EsNJIyP2RC6lioWgfGz+b1Tem6VAQ1jbUshnau/JzIWGTOJAtsZMRyZZW8m/ud1Uwyv/UyoJEVQfLEoTCXFmM6epwOhgaOcWMK4FvZWykdMM442oqINwVt+eZW0qhXvslK9r5XrN3kcBXJKzsgF8cgVqZM70iBNwokkz+SVvDmPzovz7nwsWtecfOaE/IHz+QPTQY/T</latexit> 4 ✓<latexit sha1_base64="b2Ff/oUFJw0eznxXS1RygRK2bZk=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilTg9HHGnf65crbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fja/d0rOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4bWfCZWkyBVbLApTSTAms+fJQGjOUE4soUwLeythI6opQxtRyYbgLb+8Slq1qndRrd1fVuo3eRxFOIFTOAcPrqAOd9CAJjCQ8Ayv8OY8Oi/Ou/OxaC04+cwx/IHz+QPOtY/Q</latexit> 1 ✓<latexit sha1_base64="oe3DagNbCs6bjj10ybLfZH9d5SY=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3bpZhN3J With a better understanding of the loss surfaces for multilayer networks, we can build more robust and accurate training procedures. Recently it was discovered that independently trained SGD solutions can be connected along one-dimensional paths of near-constant training loss. In this paper, we show that there are mode-connecting sim- Figure 1. A progressive understanding of the loss surfaces of neu- plicial complexes that form multi-dimensional ral networks. Left: The traditional view of loss in parameter space, manifolds of low loss, connecting many indepen- in which regions of low loss are disconnected (Goodfellow et al., 2014; Choromanska et al., 2015). Center: The revised view of dently trained models. Inspired by this discov- loss surfaces provided by work on mode connectivity; multiple ery, we show how to efficiently build simplicial SGD training solutions are connected by narrow tunnels of low complexes for fast ensembling, outperforming loss (Garipov et al., 2018; Draxler et al., 2018; Fort & Jastrzeb- independently trained deep ensembles in accu- ski, 2019). Right: The viewpoint introduced in this work; SGD racy, calibration, and robustness to dataset shift. training converges to different points on a connected volume of Notably, our approach only requires a few train- low loss. We show that paths between different training solutions ing epochs to discover a low-loss simplex, start- exist within a large multi-dimensional manifold of low loss. We ing from a pre-trained solution. Code is avail- provide a two dimensional representation of these loss surfaces in able at https://github.com/g-benton/ Figure A.1. loss-surface-simplexes. isolated low loss modes that can be found through training 1. Introduction randomly initialized networks. In the center we have a Despite significant progress in the last few years, little is more contemporary view, showing that there are paths that known about neural network loss landscapes. Recent works connect these modes. On the right we present a new view — have shown that the modes found through SGD training that all modes found through standard training converge to multi-dimensional of randomly initialized models are connected along narrow points within a single connected volume pathways connecting pairs of modes, or through tunnels of low loss. connecting multiple modes at once (Garipov et al., 2018; We introduce Simplicial Pointwise Random Optimization Draxler et al., 2018; Fort & Jastrzebski, 2019). In this (SPRO) as a method of finding simplexes and simplicial paper we show that there are in fact large multi-dimensional complexes that bound volumes of low loss in parame- arXiv:2102.13042v1 [cs.LG] 25 Feb 2021 simplicial complexes of low loss in the parameter space of ter space. With SPRO we are able to find mode con- neural networks that contain arbitrarily many independently necting spaces that simultaneously connect many indepen- trained modes. dently trained models through a a single well defined multi- The ability to find these large volumes of low loss that dimensional manifold. Furthermore, SPRO is able to explic- can connect any number of independent training solutions itly define a space of low loss solutions through determining represents a natural progression in how we understand the the bounding vertices of the simplicial complex, meaning loss landscapes of neural networks, as shown in Figure1. that computing the dimensionality and volume of the space In the left of Figure1, we see the classical view of loss become straightforward, as does sampling models within surface structure in neural networks, where there are many the complex. This enhanced understanding of loss surface structure en- 1New York University. Correspondence to: Gregory W. Benton <[email protected]>. ables practical methodological advances. Through the ability to rapidly sample models from within the simplex we can form Ensembled SPRO (ESPRO) models. ESPRO works Loss Surface Simplexes by generating a simplicial complex around independently Most closely related to our work, Fort & Jastrzebski(2019) trained models and ensembling from within the simplexes, propose viewing the loss landscape as a series of potentially outperforming the gold standard deep ensemble combina- connected low-dimensional wedges in the much higher di- tion of independently trained models (Lakshminarayanan mensional parameter space. They then demonstrate that et al., 2017). We can view this ensemble as an approxima- sets of optima can be connected via low-loss connectors tion to a Bayesian model average, where the posterior is that are generalizations of Garipov et al.(2018)’s proce- uniformly distributed over a simplicial complex.

Load more