Discretization Drift in Two-Player Games

Mihaela Rosca 1 2 Yan Wu 1 Benoit Dherin 3 David G.T. Barrett 1

Abstract of two player games by finding continuous systems which Gradient-based methods for two-player games better match the gradient descent updates used in practice. produce rich dynamics that can solve challenging Our work builds upon Barrett & Dherin(2021), who use problems, yet can be difficult to stabilize and un- backward error analysis to quantify the discretization drift derstand. Part of this complexity originates from induced by using gradient descent in supervised learning. the discrete update steps given by simultaneous or We extend their work and use backward error analysis to alternating gradient descent, which causes each understand the impact of discretization in the training dy- player to drift away from the continuous gradient namics of two-player games. More specifically, we quantify flow — a phenomenon we call discretization drift. the Discretization Drift (DD), the difference between the Using backward error analysis, we derive mod- solutions of the original ODEs defining the game and the ified continuous dynamical systems that closely discrete steps of the numerical scheme used to approximate follow the discrete dynamics. These modified dy- them. To do so, we construct modified continuous systems namics provide an insight into the notorious chal- that closely follow the discrete updates. While in supervised lenges associated with zero-sum games, including learning DD has a beneficial regularization effect (Barrett & Generative Adversarial Networks. In particular, Dherin, 2021), we find that the interaction between players we identify distinct components of the discretiza- in DD can have a destabilizing effect in adversarial games. tion drift that can alter performance and in some cases destabilize the game. Finally, quantifying Contributions: Our primary contribution, Theorems 3.1 discretization drift allows us to identify regular- and 3.2, provides the continuous modified systems which izers that explicitly cancel harmful forms of drift quantify the discretization drift in simultaneous and alternat- or strengthen beneficial forms of drift, and thus ing gradient descent for general two-player differentiable improve performance of GAN training. games. Both theorems are novel in their scope and gen- erality, as well as their application toward understanding the effect of the discretization drift in two-player games 1. Introduction parametrized with neural networks. Theorems 3.1 and 3.2 allow us to use dynamical system analysis to describe GD The fusion of deep learning with two-player games has pro- without ignoring discretization drift, which we then use to: duced a wealth of breakthroughs in recent years, from Gen- erative Adversarial Networks (GANs) (Goodfellow et al., 2014) through to model-based reinforcement learning (Sut- • Provide new stability analysis tools (Section4). ton & Barto, 2018; Rajeswaran et al., 2020). Gradient descent methods are widely used across these settings, partly • Motivate explicit regularization methods which drasti- because these algorithms scale well to high-dimensional cally improve the performance of simultaneous gradi- models and large datasets. However, much of the recent ent descent in GAN training (Section 7.3). progress in our theoretical understanding of two-player differentiable games builds upon the analysis of continuous • Pinpoint optimal regularization coefficients and shed differential equations that model the dynamics of train- new light on existing explicit regularizers (Table1). ing (Singh et al., 2000; Heusel et al., 2017; Nagarajan & Kolter, 2017), leading to a gap between theory and prac- • Pinpoint the best performing learning rate ratios for tice. Our aim is to take a step forward in our understanding alternating updates (Sections5 and 7.2). 1DeepMind, London, UK 2Center for Artificial Intelligence, 3 University College London Google, Dublin, Ireland. Correspon- • Explain previously observed but unexplained phenom- dence to: Mihaela Rosca . ena such as the difference in performance and stability Proceedings of the 38 th International Conference on Machine between alternating and simultaneous updates in GAN Learning, PMLR 139, 2021. Copyright 2021 by the author(s). training (Section6). 2. Background t − 1 → t t → t + 1 Backward Error Analysis: Backward error analysis was θ˙ ˙ devised to study the long-term error incurred by following θ ˜˙ ODE error an ODE numerical solver instead of an exact ODE solu- ˙ θ 2 θ˜ O(h ) tion (Hairer & Lubich, 1997; Hairer et al., 2006). The θt−1 Modified ODE error O(h3) general idea is to find a modified version of the original θt+1 ODE that follows the steps of the numerical solver exactly. θt Recently, Barrett & Dherin(2021) used this technique to uncover a form of DD, called implicit gradient regularization, arising in supervised learning for models trained with Figure 1. Visualization of our approach, given by backward error gradient descent. They showed that for a model with pa- analysis. For each player (we show only θ for simplicity), we ˙ rameters θ and loss L(θ) optimized with gradient descent find the modified ODE ˜θ which captures the change in parameters θt = θt−1 − h∇θL(θ), the first order modified equation is introduced by the discrete updates with an error of O(h3). The ˙ ˜ θ = −∇θL(θ), with modified loss modified ODE follows the discrete update more closely than the original ODE θ˙ , which has an error of O(h2). h L˜(θ) = L(θ) + k∇ L(θ)k2 (1) 4 θ This shows that there is a hidden implicit regularization ef- using either simultaneous or alternating Euler updates. We fect, dependent on learning rate h that biases learning toward derive a modified continuous system of the form: flat regions, where test errors are typically smaller (Barrett ˙ & Dherin, 2021). φ = f(φ, θ) + hf1(φ, θ) (4) ˙ Two-player games: A well developed strategy for under- θ = g(φ, θ) + hg1(φ, θ) (5) standing two-player games in gradient-based learning is to analyze the continuous dynamics of the game (Singh et al., that closely follows the discrete Euler update steps; the local 2000; Heusel et al., 2017). Tools from dynamical systems error between a discrete update and the modified system theory have been used to explain convergence behaviors and is of order O(h3) instead of O(h2) as is the case for the to improve training using modifications of learning objec- original continuous system given by Equations (2) and (3). 3 tives or learning rules (Nagarajan & Kolter, 2017; Balduzzi If we can neglect errors of order O(h ), the terms f1 and et al., 2018; Mazumdar et al., 2019). Many of the insights g1 above characterize the DD of the discrete scheme, which from continuous two-player games apply to games that are can be used to help us understand the impact of DD. For trained with discrete updates. However, discrepancies are example, it allows us to compare the use of simultaneous often observed between the continuous analysis and discrete and alternating Euler updates, by comparing the dynamics training (Mescheder et al., 2018). In this work, we extend of their associated modified systems as characterized by the backward error analysis from the supervised learning setting DD terms f1 and g1. – a special case of a one-player game – to the more general We can specialize these modified equations using f = two-player game setting, thereby bridging this gap between −∇φL1 and g = −∇θL2, where L1 and L2 are the loss the continuous systems that are often analyzed in theory and functions for the two players. We will use this setting later to the discrete numerical methods used in practice. investigate the modified dynamics of simultaneous or alternating gradient descent. We can further specialize the form 3. Discretization drift of these updates for common-payoff games (L1 = L2 = E) and zero-sum games (L1 = −E, L2 = E). Throughout the paper we will denote by φ ∈ Rm and θ ∈ n the row vectors representing the parameters of R 3.1. DD for simultaneous Euler updates the first and second player, respectively. The players update functions will be denoted correspondingly by f(φ, θ): The simultaneous Euler updates with learning rates αh and Rm × Rn → Rm and by g(φ, θ): Rm × Rn → Rn. In λh respectively are given by this setting, the partial derivative ∇θf(φ, θ) is the n × m matrix (∂ f ) with i = 1, . . . , n and j = 1, . . . , m. θi j ij φt = φt−1 + αhf(θt−1, φt−1) (6)

We aim to understand the impact of discretizing the ODEs θt = θt−1 + λhg(θt−1, φt−1) (7)

φ˙ = f(φ, θ), (2) ˙ θ = g(φ, θ), (3) Theorem 3.1 The discrete simultaneous Euler updates in (6) and (7) follow the continuous system Step 1: We expand the numerical scheme to find a relationship between φt and φt−1 and θt and θt−1 up to order ˙ αh 2 φ = f − (f∇φf + g∇θf) O(h ). In the case of the simultaneous Euler updates this 2 does not require any change to Equations (6) and (7), while ˙ λh θ = g − (f∇φg + g∇θg) for alternating updates we have to expand the intermediate 2 steps of the integrator using Taylor series. 3 with an error of size O(h ) after one update step. Step 2: We compute the Taylor’s series of the modified equations solution, yielding: Remark: A special case of Theorem 3.1 for zero-sum games with equal learning rates can be found in Lu(2021). 1 φ˜(h) = φ + hf + h2(f + (f∇ f + g∇ f)) + O(h3) t−1 1 2 φ θ 3.2. DD for alternating Euler updates 1 θ˜(h) = θ + hg + h2(g + (f∇ g + g∇ g)) + O(h3), t−1 1 2 φ θ For alternating Euler updates, the players take turns to update their parameters, and can perform multiple updates where all the f’s, g’s, and their derivatives are evaluated at each. We denote the number of alternating updates of the (φt−1, θt−1). first player (resp. second player) by m (resp. k). We scale Step 3: We match the terms of equal power in h so that the the learning rates by the number of updates, leading to the solution of modified equations coincides with the discrete following updates φt := φm,t and θt := θk,t where update after one step. This amounts to finding the correc- αh tions, fi’s and gi’s, so that all the terms of order higher than φ = φ + f(φ , θ ), i = 1 . . . m, (8) O(h) in the modified equation solution above will vanish; i,t i−1,t m i−1,t t−1 this yields the first order drift terms f and g in terms of f, λh 1 1 θj,t = θj−1,t + g(φm,t, θj−1,t), j = 1 . . . k. (9) g, and their derivatives. For simultaneous updates we obtain k 1 1 f1 = − 2 (f∇φf + g∇θf) and g1 = − 2 (f∇φg + g∇θg), and by construction, we have obtained the modified trun- Theorem 3.2 The discrete alternating Euler updates in (8) cated system which follows the discrete updates exactly up and (9) follow the continuous system to order O(h3), leading to Theorem 3.1. αh 1 Remark: The modified equations in Theorems 3.1 and 3.2 φ˙ = f − f∇ f + g∇ f 2 m φ θ closely follow the discrete updates only for learning rates 3 λh 2α 1 where errors of size O(h ) can be neglected. Beyond this, θ˙ = g − (1 − )f∇ g + g∇ g higher order corrections are likely to contribute to the DD. 2 λ φ k θ with an error of size O(h3) after one update step. 3.4. Visualizing trajectories

Remark: Equilibria of the original continuous systems (i.e., To illustrate the effect of DD in two-player games, we use a points where f = 0 and g = 0) remain equilibria of the simple example adapted from Balduzzi et al.(2018): modified continuous systems. ˙ ˙ φ = f(φ, θ) = −1φ + θ; θ = g(φ, θ) = 2θ − φ (10) Definition 3.1 The discretization drift for each player has two terms: one term containing a player’s own update func- In Figure2 we validate our theory empirically by visualizing tion only - terms we will call self terms - and a term that the trajectories of the discrete Euler steps for simultaneous also contains the other player’s update function - which we and alternating updates, and show that they closely match will call interaction terms. the trajectories of the corresponding modified continuous systems that we have derived. To visualize the trajectories of the original unmodified continuous system, we use Runge- 3.3. Sketch of the proofs Kutta 4 (a fourth-order numerical integrator that has no DD Following backward error analysis, the idea is to modify up to O(h5) in the case where the same learning rates are the original continuous system by adding corrections in used for the two players — see Hairer & Lubich(1997) and ˜ 2 powers of the learning rate: f = f + hf1 + h f2 + ··· Supplementary Material for proofs). We will use Runge- 2 and g˜ = g + hg1 + h g2 + ··· , where for simplicity in this Kutta 4 as a baseline throughout the paper. proof sketch, we use the same learning rate for both players (detailed derivations can be found in the Supplementary 4. The stability of DD Material). We want to find corrections fi, gi such that the modified system φ˙ = f˜ and θ˙ =g ˜ follows the discrete The long-term behavior of gradient-based training can be update steps exactly. To do that we proceed in three steps: characterized by the stability of its equilibria. Using stability 1e−2 Simultaneous Euler updates 1e−2 Alternating Euler updates Original Flow Modified Flow Original Flow Modified Flow Using the method above we show (in the Supplementary 7.5 Euler Updates 7.5 Euler Updates Material), that, in two-player games, the drift can change

2.5 2.5 a stable equilibrium into an unstable one, which is not the case for supervised learning (Barrett & Dherin, 2021). For −2.5 −2.5 simultaneous updates with equal learning rates, this recovers

−7.5 −7.5 a result of Daskalakis & Panageas(2018) derived in the con-

−7.5 −2.5 2.5 7.5 −7.5 −2.5 2.5 7.5 text of zero-sum games. We show this by example: consider 1e−2 1e−2 the game given by the system of ODEs in Equation (10) with 1 = 2 = 0.09. The stability analysis of its modified ODEs Figure 2. The modified continuous flow captures the effect of DD for the simultaneous Euler updates shows they diverge when in simultaneous and alternating Euler updates. αh = λh = 0.2. We use the same example to illustrate the difference in behavior between simultaneous and alternating 1e−2 Simultaneous Euler updates 1e−2 Alternating Euler updates updates: the stability analysis shows the modified ODEs 15 Original Flow Modified Flow 15 Original Flow Modified Flow Euler Updates Euler Updates for alternating updates converge to a stable equilibrium. In

5 5 both cases, the results obtained using the stability analysis of the modiﬁed ODEs is consistent with empirical outcomes −5 −5 obtained by following the corresponding discrete updates,

−15 −15 as shown in Figure3; this would not have been the case had −15 −5 5 15 −15 −5 5 15 we used the original system to do stability analysis, which 1e−2 1e−2 would have always predicted convergence to an equilibrium. Figure 3. Discretization drift can change the stability of a game. The modified ODEs help us bridge the gap between theory 1 = 2 = 0.09 and a learning rate 0.2. and practice: they allow us to extend the reach of stability analysis to a wider range of techniques used for training, analysis of a continuous system to understand discrete dy- such as alternating gradient descent. We hope the method namics in two-player games has a fruitful history; however, we provide will be used in the context of GANs, to expand prior work ignored discretization drift, since they analyse prior work such as that of Nagarajan & Kolter(2017) to the stability of the original game ODEs. Using the modi- alternating updates. However, the modified ODEs are not fied ODEs given by backward error analysis provides two without limitations: they ignore discretization errors smaller benefits here: 1) they account for the discretization drift, than O(h3), and thus they are not equivalent to the discrete and 2) they provide different ODEs for simultaneous and updates; methods that directly assess the convergence of for alternating updates, capturing the specificities of the two discrete updates (e.g. Mescheder et al.(2017)) remain an optimizers. indispensable tool for understanding discrete systems. The modified ODE approach gives us a method to analyze the stability of the discrete updates: 1) Choose the system 5. Common-payoff games and the update type – simultaneous or alternating – to be analyzed; 2) write the modified ODEs for the chosen system When the players share a common loss, as in common-payoff as given by Theorems 3.1 and 3.2; 3) write the correspond- games, we recover supervised learning with a single loss ing modified Jacobian, and evaluate it at an equilibrium; 4) E, but with the extra-freedom of training the weights cor- determine the stability of the equilibrium by computing the responding to different parts of the model with possibly eigenvalues of the modified Jacobian. different learning rates and update strategies (see for inst- nace You et al.(2018) where a per-layer learning rate is used Steps 1 and 2 are easy, and step 4 is required in the stability to obtain extreme training speedups at equal levels of test analysis of any continuous system. For step 3, we provide accuracy). A special case occurs when the two players with a general form of the modified Jacobian at the equilibrium equal learning rates (α = λ) perform simultaneous gradient point of the original system, where f = 0 and g = 0: descent. In this case, both modified losses exactly recover Equation (1). Barrett & Dherin(2021) argue that minimiz- ∇φfe ∇θfe h Je = = J − K (11) ing the loss-gradient norm, in this case, has a beneficial ∇ g ∇ g 2 φe θe effect. where J is the unmodified Jacobian and K is a matrix that In this section, we instead focus on alternating gradient depends on the update type. For simultaneous updates (see descent. We partition a neural network into two parts, cor- Supplementary Material for alternating updates) K = responding to two sets of parameters, φ for the parameters 2 α(∇φf) + α∇φg∇θf α∇θf∇φf + α∇θg∇θf closer to the input and θ for the parameters closer to the 2 . λ∇φg∇θg + λ∇φf∇φg λ(∇θg) + λ∇θf∇φg output. This procedure freezes one part of the network while 1.0 2 gradient norm of the opposite player, while the self terms 0.9 10 0.8 are minimizing the player’s own gradient norm. 101

0.7 2 ||

0.6 E ϕ 100 Remark: The modiﬁed losses of zero-sum games trained ||∇

Accuracy 0.5

0.4 −1 Simultaneous 10 Simultaneous with simultaneous gradient descent with different learning 0.3 Alternating Alternating 0.2 10−2 rates are no longer zero-sum. 0 200 400 600 800 1000 0 200 400 600 800 1000 Iteration Iteration Corollary 6.2 In a zero-sum two-player differentiable game, alternating gradient descent - as described in equa- Figure 4. In common-payoff games alternating updates lead to tions (8) and (9) - follows a gradient ﬂow given by the higher gradient norms and unstable training. modiﬁed losses

αh 2 αh 2 training the other part and alternating between the two parts. L˜1 = −E + k∇φEk − k∇θEk (16) This scenario may arise in a distributed training setting, as a 4m 4 form of block coordinate descent. In the case of common- λh 2α 2 λh 2 L˜2 = E − (1 − ) k∇φEk + k∇θEk (17) payoff games, we have the following as a special case of 4 λ 4k Theorem 3.2 by substituting f = −∇φE and g = −∇θE: with an error of size O(h3) after one update step. Corollary 5.1 In a two-player common-payoff game with common loss E, alternating gradient descent – as described Remark: The modified losses of zero-sum games trained in equations (8) and (9) - with one update per player follows with alternating gradient descent are not zero sum. a gradient flow given by the modified losses 2α Remark: Since (1 − λ ) < 1 in the alternating case there αh 2 2 is always less weight on the term encouraging maximizing L˜1 = E + k∇φEk + k∇θEk (12) 2 4 k∇φEk compared to the simultaneous case under the same λh 2α learning rates. For α > 1 both players minimize k∇ Ek2. L˜ = E + (1 − ) k∇ Ek2 + k∇ Ek2 (13) λ 2 φ 2 4 λ φ θ 6.1. Dirac-GAN: an illustrative example with an error of size O(h3) after one update step. Mescheder et al.(2018) introduce the Dirac-GAN as an 2α 2 The term (1− λ ) k∇φEk in Eq. (13) is negative when the example to illustrate the often complex training dynamics learning rates are equal, seeking to maximize the gradient in zero-sum games. We follow this example to provide an norm of player φ. According to Barrett & Dherin(2021), intuitive understanding of DD. The generative adversarial we expect less stable training and worse performance in network (Goodfellow et al., 2014) is an example of two- this case. This prediction is verified in Figure4, where we player game that has been successfully used for distribution compare simultaneous and alternating gradient descent for learning. The generator G with parameters θ learns a map- a MLP trained on MNIST with a common learning rate. ping from samples of the latent distribution z ∼ p(z) to the data space, while the discriminator D with parameters φ 6. Analysis of zero-sum games learns to distinguish these samples from data. Dirac-GAN aims to learn a Dirac delta distribution with mass at zero; the We now study zero-sum games, where the two adversarial generator is modeling a Dirac with parameter θ: G(z) = θ players have opposite losses −L1 = L2 = E. We substitute and the discriminator is a linear model on the input with the updates f = ∇φE and g = −∇θE in the Theorems in parameter φ: Dφ(x) = φx. This results in the zero-sum Section3 and obtain: game given by: Corollary 6.1 In a zero-sum two-player differentiable E = l(θφ) + l(0) game, simultaneous gradient descent updates - as described (18) in equations (6) and (7) - follows a gradient flow given by where l depends on the GAN formulation used (l(z) = the modified losses − log(1 + e−z) for instance). The unique equilibrium point αh αh is θ = φ = 0. All proofs for this section are in the Supple- L˜ = −E + k∇ Ek2 − k∇ Ek2 , (14) 1 4 φ 4 θ mentary Material, together with visualizations. λh 2 λh 2 Reconciling discrete and continuous updates in Dirac-GAN L˜2 = E − k∇φEk + k∇θEk , (15) : 4 4 The continuous dynamics induced by the gradient field from with an error of size O(h3) after one update step. Equation (18) preserve θ2 + φ2, while for simultaneous gradient descent θ2+φ2 increases with each update (Mescheder Remark: Discretization drift preserves the adversarial struc- et al., 2018); different conclusions are reached when analyz- ture of the game, with the interaction terms maximizing the ing the dynamics of the original continuous system versus Zero sum Zero sum the discrete updates. We show that the modified ODEs given 8 8 by equations (14) and (15) resolve this discrepancy, since 7 7 6 6 simultaneous gradient descent modifies the continuous dy- 5 5 2 2 namics to increase θ +φ , leading to consistent conclusions 4 4 3 3 Inceptionscore from the modified continuous and discrete dynamics. Simultaneous InceptionScore Euler (sim) 2 Alternating 2 Euler (alt) RK4 1 1 Explicit regularization stabilizes Dirac-GAN: To counter- 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Iteration 1e5 Iteration 1e5 act the instability in the Dirac-GAN induced by the interaction terms we can add explicit regularization with 2 Figure 5. The effect of DD on zero-sum games. Alternating up- the same functional form: L1 = −E + uk∇θEk and dates perform better (left), and with equal learning rates (right), 2 5 L2 = E + νk∇φEk where u, ν are of O(h). We find RK4 has O(h ) drift and performs better. that the modified Jacobian for this modified system with explicit regularization is negative definite if u > hα/4 and ν > hλ/4, so the system is asymptotically stable and con- and corresponding plots, as well as best top 20% and 30% verges to the optimum. Notably, by quantifying DD, we are of models see the Supplementary Material. We also re- able to find the regularization coefficients which guarantee port box plots showing the performance quantiles across convergence and show that they depend on learning rates. all hyperparameter and seeds, together with the top 10% of models. For SGD we use learning rates {5 × 10−2, 1 × 7. Experimental analysis of GANs 10−2, 5 × 10−3, 1 × 10−3} for each player; for Adam, we use learning rates {1×10−4, 2×10−4, 3×10−4, 4×10−4}, To understand the effect of DD on more complex adversar- which have been widely used in the literature (Miyato et al., ial games, we analyze GANs trained for image generation 2018). When comparing to Runge-Kutta (RK4) to assess on the CIFAR10 dataset. We follow the model architec- the effect of DD we always use the same learning rates for tures from Spectral-Normalized GANs (Miyato et al., 2018). both players. We present results using additional losses, Both players have millions of parameters. We employ the via LS-GAN (Mao et al., 2017), and report FID results original GAN formulation, where the discriminator is a bi- (Heusel et al., 2017) and full experimental details in the Sup- nary classifier trained to classify data from the dataset as plementary Material. The code associated with this work real and model samples as fake; the generator tries to gen- can be found at https://github.com/deepmind/ erate samples that the discriminator considers as real. This deepmind-research/dd_two_player_games. can be formulated as a zero-sum game: 7.1. Does DD affect training? ∗ E = Ep (x) log Dφ(x) + Epθ (z) log(1 − Dφ(Gθ(z)) We start our experimental analysis by showing the effect of When it comes to gradient descent, GAN practitioners often DD on zero-sum games. We compare simultaneous gradi- use alternating, not simultaneous updates: the discriminator ent descent, alternating gradient descent and Runge-Kutta is updated first, followed by the generator. However, recent 4 updates, since they follow different continuous dynam- work shows higher-order numerical integrators can work ics given by the modified equations we have derived up well with simultaneous updates (Qin et al., 2020). We will to O(h3) error. Figure5 shows simultaneous gradient de- show that DD can be seen as the culprit behind some of scent performs substantially worse than alternating updates. the challenges in simultaneous gradient descent in zero-sum When compared to Runge-Kutta 4, which has a DD error of GANs, indicating ways to improve training performance in O(h5) when the two players have equal learning rates, we this setting. For a clear presentation of the effects of DD, we see that Runge-Kutta performs better, and that removing the employ a minimalist training setup. Instead of using popular drift improves training. Multiple updates also affect training, adaptive optimizers such as Adam (Kingma & Ba, 2015), we either positively or negatively, depending on learning rates - train all the models with vanilla stochastic gradient descent, see Supplementary Material. without momentum or variance reduction methods. 7.2. The importance of learning rates in DD We use the Inception Score (IS) (Salimans et al., 2016) for evaluation. Our training curves contain a horizontal While in simultaneous updates (Eqs (14) and (15)) the in- line at the Inception Score of 7.5, obtained with the same teraction terms of both players maximize the gradient norm architectures we use, but with the Adam optimizer (the of the other player, alternating gradient descent (Eqs (16) score reported by Miyato et al.(2018) is 7.42). All learn- and (17)) exhibits less pressure on the second player (genera- ing curves correspond to the best 10% of models for the tor) to maximize the norm of the first player (discriminator). corresponding training setting from a sweep over learning In alternating updates, when the ratio between the discrim- rates and 5 seeds - for a discussion of variance across seeds inator and generator learning rates exceeds 0.5, both play- Simultaneous updates, zero sum Alternating updates, zero sum Simultaneous updates, zero sum 8 Learning rate ratio 8 Learning rate ratio 8 Zero sum 0.1 0.1 8 7 0.2 7 0.2 7 0.5 0.5 7 6 1.0 6 1.0 6 2.0 2.0 6 5 5.0 5 5.0 5 5 4 4 4 4 3 SGD

3 3 Score Inception Inception Score Inception Score Inception Cancel interaction terms 3 2 score Inception 2 2 Adam 2 1 1 1 0 1 2 3 4 5 6 1 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1e5 Iteration SGD Cancel interaction terms Adam Iteration 1e5 Iteration 1e5

Figure 6. Alternating gradient descent performs better for learning Figure 7. Simultaneous updates: Explicit regularization canceling rate ratios which reduce the adversarial nature of DD. The same the interaction terms of DD improves performance, both for the learning rate ratios show no advantage in the simultaneous case. best performing models (left) and across a sweep (right).

Simultaneous updates, zero sum Simultaneous updates, zero sum ers are encouraged to minimize the discriminator’s gradient 8 8 7 7 norm. To understand the effect of learning rate ratios in train- 6 6 ing, we performed a sweep where the discriminator learning 5 5 4 4 Cancel interaction terms rates α are sampled uniformly between [0.001, 0.01], and 3 Cancel interaction terms 3 SGD Inception Score Inception SGD Score Inception CO Coeffs 0.01 2 SGA Coeffs 0.0001 2 CO Coeffs 0.001 the learning rate ratios α/λ are in {0.1, 0.2, 0.5, 1, 2, 5}, SGA Coeffs 0.001 CO Coeffs 0.0001 SGA Coeffs 0.01 IDD coeffs 1 1 with results shown in Figure6. Learning rate ratios greater 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Iteration 1e5 Iteration 1e5 than 0.5 perform best for alternating updates, and enjoy a (a) SGA comparison. (b) CO comparison. substantial increase in performance compared to simultaneous updates. Figure 8. Comparison with Symplectic Gradient Adjustment 7.3. Improving performance by explicit regularization (SGA) and Consensus Optimization (CO): DD motivates the form of explicit regularization and provides the optimal regularization We investigate whether canceling the interaction terms be- coefficients, without requiring a larger hyperparameter sweep. tween the two players can improve training stability and performance in zero-sum games trained using simultaneous the fixed coefficients of c = c = 1 used by SGA are gradient descent. We train models using the losses: 1 2 2 always optimal while our analysis shows the strength of 2 L1 = −E + c1 k∇θEk (19) DD changes with the exact discretization scheme, such as the step size. Indeed, as our experimental results in Fig- L = E + c k∇ Ek2 (20) 2 2 φ ure 8(a) show, adjusting the coefficient in SGA strongly affects training. If c1, c2 are O(h) we can ignore the DD from these regularization terms, since their effect on DD will be of order Connection with Consensus Optimization (CO): Mescheder O(h3). We can set coefficients to be the negative of the coef- et al.(2017) analyze the discrete dynamics of gradient de- ficients present in DD, namely c1 = αh/4 and c2 = λh/4; scent in zero-sum games and prove that, under certain as- we thus cancel the interaction terms which maximize the sumptions, adding explicit regularization that encourages gradient norm of the other player, while keeping the self the players to minimize the gradient norm of both players terms, which minimize the player’s own gradient norm. We guarantees convergence to a local Nash equilibrium. Their show results in Figure7: canceling the interaction terms approach includes canceling the interaction terms, but also leads to substantial improvement compared to SGD, ob- requires strengthening the self terms, using losses: tains the same peak performance as Adam (though requires 2 2 L = −E + c k∇ Ek + s k∇ Ek (21) more training iterations) and recovers the performance of 1 1 θ 1 φ 2 2 Runge-Kutta 4 (Figure5). Unlike Adam, we do not ob- L2 = E + s2 k∇θEk + c2 k∇φEk , (22) serve a decrease in performance when trained for longer where they use s = s = c = c = γ where γ is a hyper- but report higher variance in performance across seeds - see 1 2 1 2 parameter. In order to understand the effect of the self and Supplementary Material. interaction terms, we compare to CO, as well as a similar Connection with Symplectic Gradient Adjustment (SGA): approach where we use coefficients proportional to the drift, Balduzzi et al.(2018) proposed SGA to improve the dynam- namely s1 = αh/4 and s2 = λh/4; this effectively doubles ics of gradient-based method for games, by counter-acting the strength of the self terms in DD. We show results in Fig- the rotational force of the vector field. Adjusting the gra- ure 8(b). We first notice that CO can improve results over dient field can be viewed as modifying the losses as in vanilla SGD. However, similarly to what we observed with Equation (20); the modification from SGA cancels the inter- SGA, the regularization coefficient is important and thus re- action terms we identified. However, it is unclear whether quires a hyperparameter sweep, unlike our approach which Non saturating loss Non saturating loss Alternating updates, zero sum Zero sum 8 8 8 8 7 7 7 7 6 6 6 6 5 5 5 5 4 4 4 4 3 3 Euler (sim) 3 SGD 3 Inceptionscore InceptionScore Inception Score Inception Inception score Inception Simultaneous Euler (alt) Cancel interaction terms 2 2 2 2 Alternating RK4 Cancel disc interaction term 1 1 1 1 0 1 2 3 4 5 6 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 SGD Cancel interaction Cancel disc 1e5 1e5 Iteration 1e5 terms Interaction term Iteration Iteration

Figure 9. Alternating updates: DD tells us which explicit regular- Figure 10. The effect of DD depends on the game: its effect is less ization terms to use; canceling only the discriminator interaction strong for the non-saturating loss. term improves sensitivity across hyperparameters (right).

mance of the different discrete update schemes. Since the uses the coefficients provided by the DD. We further notice effect of DD strongly depends on the game, we recommend that strengthening the norms using the DD coefficients can analyzing the performance of discrete numerical schemes improve training, but performs worse compared to only can- on a case by case basis. Indeed, while for general two-player celing the interaction terms. This shows the importance of games we cannot always write modified losses as for the finding the right training regime, and that strengthening the zero-sum case – see the Supplementary Material for a dis- self terms does not always improve performance. cussion – we can use Theorems 3.1 and 3.2 to understand Alternating updates: We perform the same exercise for alter- the effect of the drift for specific choices of loss functions. 2α nating updates, where c1 = αh/4 and c2 = λh/4(1 − λ ). We also study the performance obtained by only canceling 8. Related work the discriminator interaction term, since when α/λ > 0.5 the generator interaction term minimizes, rather than maxi- Backward error analysis: There have been a number of mizes, the discriminator gradient norm and thus the gener- recent works applying backward error analysis to machine ator interaction term might not have a strong destabilizing learning in the one-player case (supervised learning), which force. We observe that adding explicit regularization mainly can potentially be extended to the two-players settings. In brings the benefit of reduced variance when canceling the particular, Smith et al.(2021) extended the analysis of Bar- discriminator interaction term (Figure9). As for simultane- rett & Dherin(2021) to stochastic gradient descent. Kunin ous updates, we find that knowing the form of DD guides us et al.(2021) used modified gradient flow equations to show to a choice of explicit regularization: for alternating updates that discrete updates break certain conservation laws present canceling both interaction terms can hurt training, but the in the gradient flows of deep learning models. Franca et al. form of the modified losses suggests that we should only (2020) compare momentum and Nesterov acceleration us- cancel the discriminator interaction term, with which we ing their modified equations. França et al.(2021) used can obtain some gains. backward error analysis to help devise optimizers with a control on their stability and convergence rates. Li et al. Does canceling the interaction terms help for every choice (2017) used modified equation techniques in the context of of learning rates? The substantial performance and stability stochastic differential equations to devise optimizers with improvement we observe applies to the performance ob- adaptive learning rates. Other recent works such as Unlu & tained across a learning rate sweep. For individual learning Aitchison(2020) and Jastrzebski et al.(2020) noticed the rate choices however, canceling the interaction terms is not strong impact of implicit gradient regularization in the train- guaranteed to improve learning. ing of over-parametrized models with SGD using different approaches. These works provide evidence that DD has a 7.4. Extension to non-zero-sum GANs significant impact in deep learning, and that backward error analysis is a powerful tool to quantify it. At last, note that a Finally, we extend our analysis to GANs with the non- special case of Theorem 3.1 for zero-sum games with equal saturating loss for the generator E = − log D (G (z)) G φ θ learning rates can be found in Lu(2021). introduced by Goodfellow et al.(2014), while keeping the discriminator loss unchanged as ED = Ep∗(x) log Dφ(x) + Two-player games: As one of the best-known examples

Epθ (z) log(1 − Dφ(Gθ(z)). In contrast with the dynamics at the interface of game theory and deep learning, GANs from zero-sum GANs we analyzed earlier, changing from have been powered by gradient-based optimizers as other simultaneous to alternating updates results in little change in deep neural networks. The idea of an implicit regulariza- performance - as can be seen in Figure 10. Despite having tion effect induced by simultaneous gradient descent in the same adversarial structure and the same discriminator GAN training was ﬁrst discussed in Schafer¨ et al.(2020); loss, changing the generator loss changes the relative perfor- the authors show empirically that the implicit regulariza- Explicit First player (φ) Second player (θ) αh 2 αh 2 λh 2 λh 2 DD (S) 4 k∇φEk − 4 k∇θEk − 4 k∇φEk + 4 k∇θEk αh 2 αh 2 (2α−λ)h 2 λh 2 DD (A) 4m k∇φEk − 4 k∇θEk 4 k∇φEk + 4k k∇θEk αh 2 λh 2 Cancel DD interaction terms (S) X 4 k∇θEk 4 k∇φEk αh 2 (2α−λ)h 2 Cancel DD interaction terms (A) X 4 k∇θEk − 4 k∇φEk 1 2 1 2 SGA (S) X 2 k∇θEk 2 k∇φEk 2 2 2 2 Consensus optimization (S) X η k∇φEk + η k∇θEk η k∇φEk + η k∇θEk 2 Locally stable GAN (S) X X η k∇φEk 2 ODE-GAN (S) X η k∇θEk X

Table 1. Comparing DD with explicit regularization methods in zero-sum games. SGA (without alignment, see the Supplementary Material and Balduzzi et al. 2018), Consensus Optimization (Mescheder et al., 2017), Locally Stable GAN (Nagarajan & Kolter, 2017), ODE-GAN (Qin et al., 2020). We assume a minθ minφ game with learning rates αh and λh and number of updates m and k, respectively. S and A denote simultaneous and alternating updates. tion induced by gradient descent can have a positive effect cently, models that directly parametrize differential equa- on GAN training, and take a game theoretic approach to tions demonstrated additional potential from bridging the strengthening it using competitive gradient descent (Schafer¨ discrete and continuous perspectives (Chen et al., 2018; & Anandkumar, 2019). By quantifying the DD induced by Grathwohl et al., 2018). gradient descent using backward error analysis, we shed additional light on the regularization effect that discrete up- 9. Discussion dates have on GAN training, and show it has both beneficial and detrimental components (as also shown in Daskalakis We have shown that using modified continuous systems to & Panageas(2018) with different methods in the case of quantify the discretization drift induced by gradient descent simultaneous GD). Moreover, we show that the explicit reg- updates can help bridge the gap between the discrete op- ularization inspired by DD results in drastically improved timization dynamics used in practice and the analysis of performance. The form of the modified losses we have de- the original systems through continuous methods. This al- rived are related to explicit regularization for GANs, which lowed us to cast a new light on the stability and performance has been one of the most efficient methods for stabilizing of games trained using gradient descent, and guided us to- GANs as well as other games. Some of the regularizers wards explicit regularization strategies inspired by these constrain the complexity of the players (Miyato et al., 2018; modified systems. We note however that DD merely mod- Gulrajani et al., 2017; Brock et al., 2018), while others ifies a game’s original dynamics, and that the DD terms modify the dynamics for better convergence properties (Qin alone can not fully characterize a discrete scheme. In this et al., 2020; Mescheder et al., 2018; Nagarajan & Kolter, sense, our approach only complements works analyzing the 2017; Balduzzi et al., 2018; Wang et al., 2019; Mazumdar underlying game (Shoham & Leyton-Brown, 2008). Also, et al., 2019). Our approach is orthogonal in that, with back- it is worth noting that our analysis is valid for learning rates ward analysis, we start from understanding the most basic small enough that errors of size O(h3) can be neglected. gradient steps, underpinning any further modifications of We have focused our empirical analysis on a few classes of the losses or gradient fields. Importantly, we discovered two-player games, but the effect of DD will be relevant for a relationship between learning rates and the underlying all games trained using gradient descent. Our method can regularization. Since merely canceling the effect of DD be expanded beyond gradient descent to other optimization is insufficient in practice (as also observed by Qin et al. methods such as Adam (Kingma & Ba, 2015), as well as to (2020)), our approach complements regularizers that are the stochastic setting as shown in Smith et al.(2021). We explicitly designed to improve convergence. We leave to hope that our analysis of gradient descent provides a useful future research to further study other regularizers in com- building block to further the understanding and performance bination with our analysis. We summarize some of these of two-player games. regularizers in Table1. Acknowledgments. We would like to thank the reviewers To our knowledge, this is the first work towards finding and area chair for useful suggestions, and Alex Goldin, continuous systems which better match the gradient descent Chongli Qin, Jeff Donahue, Marc Deisenroth, Michael updates used in two player games. Studying discrete al- Munn, Patrick Cole, Sam Smith and Shakir Mohamed for gorithms using continuous systems has a rich history in useful discussions and support throughout this work. optimization (Su et al., 2016; Wibisono et al., 2016). Re- References Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, rule converge to a local nash equilibrium. In Advances in K., and Graepel, T. The mechanics of n-player differen- neural information processing systems, pp. 6626–6637, tiable games. In International Conference on Machine 2017. Learning, pp. 354–363. PMLR, 2018. ˚ Barrett, D. G. and Dherin, B. Implicit gradient regulariza- Jastrzebski, S. K., Arpit, D., Astrand, O., Kerg, G., Wang, tion. In International Conference on Learning Represen- H., Xiong, C., Socher, R., Cho, K., and Geras, K. J. tations, 2021. Catastrophic fisher explosion: Early phase fisher matrix impacts generalization. arXiv preprint arXiv:2012.14193, Brock, A., Donahue, J., and Simonyan, K. Large scale 2020. gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, Kingma, D. P. and Ba, J. Adam: A method for stochastic 2018. optimization. 2015.

Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, Kunin, D., Sagastuy-Brena, J., and Ganguli, Surya Daniel D. K. Neural ordinary differential equations. In Advances L.K. Yamins, H. T. Symmetry, conservation laws, and in neural information processing systems, pp. 6571–6583, learning dynamics in neural networks. In International 2018. Conference on Learning Representations, 2021.

Daskalakis, C. and Panageas, I. The limit points of (op- Li, Q., Tai, C., and E, W. Stochastic modiﬁed equations and timistic) gradient descent in min-max optimization. In adaptive stochastic gradient algorithms. In International Advances in Neural Information Processing Systems, vol- Conference on Machine Learning, volume 70, pp. 2101– ume 31, 2018. 2110, 2017.

Franca, G., Sulam, J., Robinson, D., and Vidal, R. Confor- Lu, H. An o(sr)-resolution ode framework for understand- mal symplectic and relativistic optimization. In Confer- ing discrete-time algorithms and applications to the lin- ence on Neural Information Processing Systems (NeurIPS ear convergence of minimax problems. arXiv preprint 2020). 2020. arXiv:2001.08826, 2021.

Franc¸a, G., Jordan, M. I., and Vidal, R. On dissipative Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and symplectic integration with applications to gradient-based Paul Smolley, S. Least squares generative adversarial optimization. Journal of Statistical Mechanics: Theory networks. In Proceedings of the IEEE international con- and Experiment, 2021(4):043402, 2021. ference on computer vision, pp. 2794–2802, 2017. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Mazumdar, E. V., Jordan, M. I., and Sastry, S. S. On ﬁnding Warde-Farley, D., Ozair, S., Courville, A., and Bengio, local nash equilibria (and only local nash equilibria) in Y. Generative adversarial nets. In Advances in neural zero-sum games. arXiv preprint arXiv:1901.00838, 2019. information processing systems, pp. 2672–2680, 2014.

Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., Mescheder, L., Nowozin, S., and Geiger, A. The numerics Advances in Neural Information Processing and Duvenaud, D. Ffjord: Free-form continuous dy- of gans. In Systems namics for scalable reversible generative models. arXiv , pp. 1825–1835, 2017. preprint arXiv:1810.01367, 2018. Mescheder, L., Geiger, A., and Nowozin, S. Which training Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and methods for gans do actually converge? In International Courville, A. C. Improved training of wasserstein gans. conference on machine learning, pp. 3481–3490. PMLR, In Advances in neural information processing systems, 2018. pp. 5767–5777, 2017. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spec- Hairer, E. and Lubich, C. The life-span of backward error tral normalization for generative adversarial networks. In analysis for numerical integrators. Numerische Mathe- International Conference on Learning Representations, matik, 76:441–457, 06 1997. 2018.

Hairer, E., Lubich, C., and Wanner, G. Geometric numerical Nagarajan, V. and Kolter, J. Z. Gradient descent gan opti- integration, volume 31. Springer-Verlag, Berlin, second mization is locally stable. In Advances in neural informa- edition, 2006. ISBN 3-540-30663-3; 978-3-540-30663-4. tion processing systems, pp. 5585–5595, 2017. Qin, C., Wu, Y., Springenberg, J. T., Brock, A., Donahue, J., Lillicrap, T. P., and Kohli, P. Training generative adversarial networks by solving ordinary differential equations. 2020. Rajeswaran, A., Mordatch, I., and Kumar, V. A game theoretic framework for model based reinforcement learning. In International Conference on Machine Learning, pp. 7953–7963. PMLR, 2020. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242, 2016. Schafer,¨ F. and Anandkumar, A. Competitive gradient descent. In Advances in Neural Information Processing Systems, pp. 7625–7635, 2019. Schafer,¨ F., Zheng, H., and Anandkumar, A. Implicit competitive regularization in gans. In International Confer- ence on Machine Learning, pp. 8533–8544. PMLR, 2020. Shoham, Y. and Leyton-Brown, K. Multiagent systems: Al- gorithmic, game-theoretic, and logical foundations. Cam- bridge University Press, 2008. Singh, S. P., Kearns, M. J., and Mansour, Y. Nash convergence of gradient dynamics in general-sum games. In UAI, pp. 541–548, 2000. Smith, S. L., Dherin, B., Barrett, D. G., and De, S. On the origin of implicit regularization in stochastic gradient descent. In International Conference on Learning Representations, 2021. Su, W., Boyd, S., and Candes, E. J. A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. The Journal of Machine Learning Research, 17(1):5312–5354, 2016. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Unlu, A. and Aitchison, L. Gradient regularisation as approximate variational inference. In Symposium on Ad- vances in Approximate Bayesian Inference, 2020. Wang, Y., Zhang, G., and Ba, J. On solving minimax optimization locally: A follow-the-ridge approach. In Inter- national Conference on Learning Representations, 2019. Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational perspective on accelerated methods in optimization. proceedings of the National Academy of Sciences, 113(47): E7351–E7358, 2016. You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., and Keutzer, K. Imagenet training in minutes. In International Conference on Parallel Processing, pp. 1–10, 2018.