On the Suboptimality of Negative Momentum for Minimax Optimization
Total Page:16
File Type:pdf, Size:1020Kb
On the Suboptimality of Negative Momentum for Minimax Optimization Guodong Zhang Yuanhao Wang University of Toronto, Vector Institute Princeton University Abstract In particular, our problem of interest is the following minimax optimization problem: Smooth game optimization has recently at- min max f(x, y). (1) x y tracted great interest in machine learning as 2X 2Y it generalizes the single-objective optimiza- We are usually interested in finding a Nash equilib- tion paradigm. However, game dynamics rium (Von Neumann and Morgenstern, 1944): a set is more complex due to the interaction be- of parameters from which no player can (locally and tween di↵erent players and is therefore fun- unilaterally) improve its objective function. Though damentally di↵erent from minimization, pos- the dynamics of gradient based methods are well un- ing new challenges for algorithm design. No- derstood for minimization problems, new issues and tably, it has been shown that negative mo- challenges appear in minimax games. For example, mentum is preferred due to its ability to re- the na¨ıve extension of gradient descent can fail to con- duce oscillation in game dynamics. Neverthe- verge (Letcher et al., 2019; Mescheder et al., 2017) or less, the convergence rate of negative momen- converge to undesirable stationary points (Mazumdar tum was only established in simple bilinear et al., 2019; Adolphs et al., 2019; Wang et al., 2019). games. In this paper, we extend the analy- sis to smooth and strongly-convex strongly- Another important di↵erence between minimax games concave minimax games by taking the varia- and minimization problems is that negative mo- tional inequality formulation. By connecting mentum value is preferred for improving conver- Polyak’s momentum with Chebyshev polyno- gence (Gidel et al., 2019b). To be specific, for the mials, we show that negative momentum ac- blinear case f(x, y)=x>Ay, negative momentum celerates convergence of game dynamics lo- with alternating updates converges to ✏-optimal solu- tion with an iteration complexity of ()wherethe cally, though with a suboptimal rate. To the O condition number is defined as = λmax(A>A) , best of our knowledge, this is the first work λmin(A>A) that provides an explicit convergence rate for whereas Gradient Descent Ascent (GDA) fails to con- negative momentum in this setting. verge. Moreover, the rate of negative momentum matches the optimal rate of Extra-gradient (EG) (Ko- rpelevich, 1976) and Optimistic Gradient Descent As- 1 Introduction cent (OGDA) (Daskalakis et al., 2018; Mertikopoulos et al., 2019). A natural question to ask then is: Due to the increasing popularity of generative ad- Does negative momentum improve on GDA versarial networks (Goodfellow et al., 2014; Radford for other settings? et al., 2015; Arjovsky et al., 2017), adversarial train- In this paper, we extend the analysis of negative mo- ing (Madry et al., 2018) and primal-dual reinforcement mentum1 to the strongly-convex strongly-concave set- learning (Du et al., 2017; Dai et al., 2018), minimax op- ting and answer the above question in the affirma- timization (or generally game optimization) has gained tive. In particular, we observe that momentum meth- significant attention as it o↵ers a flexible paradigm ods (Polyak, 1964), either positive or negative, can be that goes beyond ordinary loss function minimization. connected to Chebyshev iteration (Manteu↵el, 1977) Proceedings of the 24th International Conference on Artifi- in solving linear systems, which enables us to de- cial Intelligence and Statistics (AISTATS) 2021, San Diego, rive the optimal momentum parameter and asymptotic California, USA. PMLR: Volume 130. Copyright 2021 by 1Throughout the paper, negative momentum represents the author(s). gradient descent-ascent with negative momentum. Running heading title breaks the line convergence rate. With optimally tuned parameters, 2.1 Variational Inequality Formulation of negative momentum achieves an acceleration locally Minimax Optimization with an improved iteration complexity (1.5) as op- posed to the (2) complexity of GradientO Descent We begin by presenting the basic variational inequality Ascent (GDA).O Following on that, we further ask: framework that we will consider throughout the paper. To that end, let be a nonempty convex subset of Is negative momentum optimal in the same setting? d, and let F : dZ d be a continuous mapping on Does it match the iteration complexity of EG and R R R d. In its most general! form, the variational inequality OGDA again? R (VI) problem (Harker and Pang, 1990) associated to We answer these questions in the negative. Particu- F and can be stated as: larly, our analysis implies that the iteration complex- Z ity lower bound for negative momentum is ⌦(1.5). find z⇤ s.t. F (z⇤)>(z z⇤) 0 z . (2) 2Z − ≥ 8 2Z Nevertheless, the optimal iteration complexity for In the case of = Rd, the problem is reduced to find- this family of problems under first-order oracle is Z ⌦()(Ibrahim et al., 2019; Zhang et al., 2019), which ing z⇤ such that F (z⇤) = 0. To provide some intuition can be achieved by EG and OGDA. Therefore, we for about the variational inequality problem, we discuss the first time show that negative momentum alone is two important examples below: suboptimal for strongly-convex strongly-concave min- Example 1 (Minimization). Suppose that F = f rz imax games. To the best of our knowledge, this is the for a smooth function f on Rd, then the varitional first work that provides an explicit convergence rate inequality problem is essentially finding the critical for negative momentum in this setting. points of f. In the case where f is convex, any so- lution of (2) would be a global minimum. Organization. In Section 2, we define our notation and formulate minimax optimization as a variational Example 2 (Minimax Optimization). Consider the inequality problem. Under the variational inequality convex-concave minimax optimization (or saddle-point framework, we further write first-order methods as dis- optimization) problem, where the objective is to solve crete dynamical systems and show that we can safely the following problem linearize the dynamics for proving local convergence min max f(x, y), where f is a smooth function. (3) rates (thus simplifying the problem to that of solving x y linear systems). In Section 3, we discuss the connec- tion between first-order methods and polynomial ap- One can show that it is a special case of (2) with proximation and show that we can analyze the conver- F (z)=[ xf(x, y)>, yf(x, y)>]>. r r gence of a first-order method through the sequence of polynomials it defines. In Section 4, we prove the local Notably, the vector field F in Example 2 is not nec- convergence rate of negative momentum for minimax essarily conservative, i.e., it might not be the gradi- games by connecting it with Chebyshev polynomials, ent of any function. In addition, since f in minimax showing that it has a suboptimal rate locally. Finally, problem happens to be convex-concave, any solution in Section 6, we validate our claims in simulation. z⇤ =[x⇤>, y⇤>]> of (2) is a global Nash Equilib- rium (Von Neumann and Morgenstern, 1944), i.e., d f(x⇤, y) f(x⇤, y⇤) f(x, y⇤) x and y R . 2 Preliminaries 8 2 In this work, we are particularly interested in the Notation. In this paper, scalars are denoted by lower- case of f being a strongly-convex-strongly-concave case letters (e.g., λ), vectors by lower-case bold letters and smooth function, which essentially assumes that (e.g., z), matrices by upper-case bold letters (e.g., J) the vector field F is strongly-monotone and Lipschitz and operators by upper-case letters (e.g., F ). The su- (see Fallah et al. (2020, Lemma 2.6)). Here we state our assumptions formally. perscript > represents the transpose of a vector or a matrix. The spectrum of a square matrix A is denoted Assumption 1 (Strongly Monotone). The vector field by Sp(A), and its eigenvalue by λ.Weuse and F is µ-strongly-monotone: to denote the real part and imaginary part of< a com-= 2 d plex scalar respectively. We use and to denote (F (z1) F (z2))>(z1 z2) µ z1 z2 2, z1, z2 R . R C − − ≥ k − k 8 2 the set of real numbers and complex numbers, respec- (4) n 1/n tively. We use ⇢(A)=limn A to denote the Assumption 2 (Lipschitz). The vector field F is L- spectral radius of matrix A!1. ,k⌦andk ⇥are standard O Lipschitz if the following holds: asymptotic notations. We use ⇧t to denote the set of d real polynomials with degree no more than t. F (z1) F (z2) 2 L z1 z2 2, z1, z2 R . (5) k − k k − k 8 2 Guodong Zhang, Yuanhao Wang Table 1: First-order algorithms for smooth and strongly-monotone games. Method Parameter Choice Complexity Reference GDA ↵ = 0, β =0 (2) Ryu and Boyd (2016); Azizian et al. (2020a) O OGDA ↵ = 1, β =0 () Gidel et al. (2019a); Mokhtari et al. (2020) O NM ↵ = 0, β<0 ⇥(1.5) This paper (Theorem 2) In the context of variational inequalites, Lipschitzness convergence rate for any algorithm of the form (6): and (strong) monotonicity are fairly standard and have k been used in many classical works (Tseng, 1995; Chen z z⇤ ⇢ z z⇤ k k − k2 ≥ optk 0 − k2 and Rockafellar, 1997; Nesterov, 2007; Nemirovski, 2µ (9) with ⇢ =1 . 2004). With these two assumptions in hand, we de- opt − µ + L fine the condition number , L/µ, which measures the hardness of the problem. In the following, we turn to suitable optimization techniques for the variational 2.3 Dynamical System Viewpoint and Local inequality problem.