205: the Complexity of Nonconvex-Strongly-Concave

The Complexity of Nonconvex-Strongly-Concave Minimax Optimization

Siqi Zhang*1 Junchi Yang*1 Cristóbal Guzmán2 Negar Kiyavash3 Niao He4

1University of Illinois at Urbana-Champaign (UIUC) 2University of Twente & Pontificia Universidad Católica de Chile 3École polytechnique fédérale de Lausanne (EPFL) 4ETH Zürich

Abstract where f, fi are continuously differentiable and f is L- Lipschitz smooth jointly in x and y. We focus on the setting when f is µ-strongly concave in y and perhaps nonconvex This paper studies the complexity for finding ap- in x, i.e., f is nonconvex-strongly-concave (NC-SC). Such proximate stationary points of nonconvex-strongly- problems arise ubiquitously in machine learning, e.g., GANs concave (NC-SC) smooth minimax problems, in with regularization [Sanjabi et al., 2018, Lei et al., 2020], both general and averaged smooth finite-sum set- Wasserstein robust models [Sinha et al., 2018], robust learn- tings. We establish√ nontrivial lower√ complexity ing over multiple domains [Qian et al., 2019], and off-policy bounds of Ω( κ∆Lϵ−2) and Ω(n+ nκ∆Lϵ−2) reinforcement learning [Dai et al., 2017, 2018, Huang and for the two settings, respectively, where κ is Jiang, 2020]. Since the problem is nonconvex in general, the condition number, L is the smoothness con- a natural goal is to find an approximate stationary point stant, and ∆ is the initial gap. Our result reveals x¯, such that ∥∇Φ(¯x)∥ ≤ ϵ, for a given accuracy ϵ, where substantial gaps between these limits and best- Φ(x) ≜ maxy f(x, y) is the primal function. This goal is known upper bounds in the literature. To close meaningful for the aforementioned applications, e.g., in these gaps, we introduce a generic acceleration adversarial models the primal function quantifies the worst- scheme that deploys existing gradient-based meth- case loss for the learner, with respect to adversary’s actions. ods to solve a sequence of crafted strongly-convex- strongly-concave subproblems. In the general set- There exists a number of algorithms for solving NC- ting, the complexity of our proposed algorithm SC problems in the general setting, including GDmax nearly matches the lower bound; in particular, it re- [Nouiehed et al., 2019], GDA [Lin et al., 2020a], alternat- moves an additional poly-logarithmic dependence ing GDA [Yang et al., 2020a, Bo¸tand Böhm, 2020, Xu on accuracy present in previous works. In the av- et al., 2020], Minimax-PPA [Lin et al., 2020b]. Specif- ically, GDA and its alternating variant both achieve the eraged smooth finite-sum setting, our proposed 2 −2 algorithm improves over previous algorithms by complexity of O(κ ∆Lϵ ) [Lin et al., 2020a, Yang et al., κ L ∆ providing a nearly-tight dependence on the condi- 2020a], where ≜ µ is the condition number and ≜ tion number. Φ(x0) − infx Φ(x) is the initial function gap. Recently, [Lin et al., 2020b] provided the best-known complexity √ −2 2 κL of O κ∆Lϵ · log ( ϵ ) achieved by Minimax-PPA, 1 INTRODUCTION which improves the dependence on the condition number 1 but suffers from an extra poly-logarithmic factor in ϵ . In this paper, we consider general minimax problems of the In the finite-sum setting, several algorithms have been form (n, d , d ∈ +): 1 2 N proposed recently, e.g., PGSMD [Rafique et al., 2018], min max f(x, y), (1) SGDmax [Jin et al., 2020], Stochastic GDA [Lin et al., x∈ d1 y∈ d2 R R 2020a], SREDAs [Luo et al., 2020]. In particular, [Lin et al., as well as their finite-sum counterpart: 2020a] proved that Stochastic GDA attains the complexity 3 −4 n of O(κ ϵ ). [Luo et al., 2020] recently showed the state- 1 X of-the-art result achieved by SREDA: when n ≥ κ2, the min max f(x, y) ≜ fi(x, y), (2) √ x∈ d1 y∈ d2 n ˜ κ 2 −2 R R i=1 complexity is O n log ϵ + nκ ∆Lϵ , which is sharper than the batch Minimax-PPA algorithm; when n ≤ κ2, the 2 −2 *Equal contribution. complexity is O nκ + κ ∆Lϵ , which is sharper than

Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021). Stochastic GDA. alyst acceleration framework for NC-SC minimax prob-

lems, inspired by [Lin et al., 2018a, Yang et al., 2020b], Despite this active line of research, whether these state-of- which applies existing gradient-based methods to solving the-art complexity bounds can be further improved remains a sequence of crafted strongly-convex-strongly-concave elusive. As a special case by restricting the domain of y (SC-SC) minimax subproblems. When combined with the to a singleton, lower bounds for nonconvex smooth mini- extragradient method, the resulting algorithm achieves mization, e.g., [Carmon et al., 2019a,b, Fang et al., 2018, √ an O˜( κ∆Lϵ−2) complexity in terms of gradient eval- Zhou and Gu, 2019, Arjevani et al., 2019], hardly capture uations, which tightly matches the lower bound in the the dependence on the condition number κ, which plays a general setting (up to logarithmic terms in constants) and crucial role in the complexity for general NC-SC smooth shaves off the extra poly-logarithmic term in 1 required minimax problems. In many of the aforementioned machine ϵ by the state-of-the-art [Lin et al., 2020b]. When combined learning applications, the condition number is inversely pro- with stochastic variance-reduced method, the resulting portional the regularization parameter, and can be very large √ algorithm achieves an overall O˜(n + n3/4 κ)∆Lϵ−2 in practice. For example, in statistical learning, where n rep- complexity for averaged smooth finite-sum problems, resents the sample size, the optimal regularization parameter which has nearly-tight dependence on the condition num- (i.e. with optimal empirical/generalization trade-off) leads √ ber and improves on the best-known upper bound when to κ = Ω( n) [Shalev-Shwartz and Ben-David, 2014]. n ≤ κ4. This motivates the following fundamental questions: What is the complexity limit for NC-SC problems in the general 1.2 RELATED WORK and finite-sum settings? Can we design new algorithms to meet the performance limits and attain optimal dependence Lower bounds for minimax problems. Information- on the condition number? based complexity (IBC) theory [Traub et al., 1988], which derives the minimal number of oracle calls to attain an approximate solution with a desired accuracy, is often used in 1.1 CONTRIBUTIONS lower bound analysis of optimization algorithms. Unlike the case of minimization, e.g., [Nemirovski and Yudin, 1983, Our contributions, summarized in Table 1, are as follows: Carmon et al., 2019a,b, Arjevani et al., 2019], lower bounds • We establish nontrivial lower complexity bounds for for minimax optimization are far less understood; only a few finding an approximate stationary point of nonconvex- recent works provided lower bounds for finding an approximate saddle point of (strongly)-convex-(strongly)-concave strongly-concave (NC-SC) minimax√ problems. In the general setting, we prove an Ω κ∆Lϵ−2 lower com- minimax problems [Ouyang and Xu, 2019, Zhang et al., plexity bound which applies to arbitrary determinis- 2019, Ibrahim et al., 2020, Xie et al., 2020, Yoon and Ryu, tic linear-span algorithms interacting with the classical 2021]. Instead, this paper considers lower bounds for NC- SC problems of finding an stationary point, which requires first-order√ oracle. In the finite-sum setting, we prove an Ωn + nκ∆Lϵ−2 lower complexity bound (when different techniques for constructing zero-chain properties. κ = Ω(n))1 for the class of averaged smooth functions Note that there exists another line of research on the purely and arbitrary linear-span algorithms interacting with a stochastic setting, e.g., [Rafique et al., 2018, Luo et al., (randomized) incremental first-order oracle (precise defi- 2020]; constructing lower bounds in that setting is out of nitions in Sections 2 and 3). the scope of this paper. Our lower bounds build upon two main ideas: first, we Complexity of making gradient small. In nonconvex op- start from an NC-SC function whose primal function timization, most lower and upper complexity bound results mimics the lower bound construction in smooth noncon- are presented in terms of the gradient norm (see a recent vex minimization [Carmon et al., 2019a]. Crucially, the survey [Danilova et al., 2020] and references therein for smoothness parameter of this primal function is boosted more details). For convex optimization, the optimality gap Ω(κ) by an factor, which strengthens the lower bound. based on the objective value is commonly used as the con- Second, the function has an alternating zero-chain struc- vergence criterion. The convergence in terms of gradient ture, as utilized in lower bounds for convex-concave set- norm, albeit easier to check, are far less studied in the lit- tings [Ouyang and Xu, 2019]. The combination of these erature until recently; see e.g., [Nesterov, 2012, Allen-Zhu, features leads to a hard instance for our problem. 2018, Diakonikolas and Guzmán, 2021] for convex mini- • To bridge the gap between the lower bounds and existing mization and [Diakonikolas, 2020, Yoon and Ryu, 2021] for upper bounds in both settings, we introduce a generic Cat- convex-concave smooth minimax problems.

1A concurrent work by Han et al. [2021] provided a similar Nonconvex minimax optimization. In NC-SC setting, lower bound result for finite-sum NC-SC problems under proba- as we mentioned, there has been several substantial works. bilistic arguments based on geometric random variables. Among them, Lin et al. [2020b] achieved the best depen- Table 1: Upper and lower complexity bounds for finding an approximate stationary point. Here O˜(·) hides poly-logarithmic L factor in L, µ and κ. L: Lipschitz smoothness parameter; µ: strong concavity parameter, κ: condition number µ ; ∆: initial gap of the primal function.

Setting Our Lower Bound Our Upper Bound Previous Upper Bound √ √ Ω κ∆Lϵ−2 O˜( κ∆Lϵ−2) O(κ2∆Lϵ−2) [Lin et al., 2020a] NC-SC, general √ Theorem 3.1 ˜ −2 2 1 Section 4.2 O κ∆Lϵ log ϵ [Lin et al., 2020b] √ ( ˜ 2 −2 2 √ −2 3 √ −2 O(n + nκ ∆Lϵ ) n ≥ κ Ω n + nκ∆Lϵ O˜ n + n 4 κ ∆Lϵ NC-SC, FS, AS1 Onκ + κ2∆Lϵ−2 n ≤ κ2 Theorem 3.2 Section 4.2 [Luo et al., 2020]

1 FS: finite-sum, AS: averaged smooth; see Section 2 for definitions. dency on condition number by combining proximal point further write f = O˜(g) to omit poly-logarithmic terms on algorithm with accelerated gradient descent. Luo et al. constants L, µ and κ, while f = Ω(g) if f(x) ≥ cg(x) (see [2020] introduced a variance reduction algorithm, SREDA. more in Appendix A). Guo et al. [2020] provided algorithms for NC-SC min- We introduce definitions and assumptions used throughout. imax formulation of AUC maximization problems with an additional assumption that the primal function satisfies Polyak-Łojasiewicz condition. In addition, nonconvex- Definition 2.1 (Primal and Dual Functions) For a func- concave minimax optimization, i.e., the function f is only tion f(x, y), we define Φ(x) ≜ maxy f(x, y) as the primal concave in y, is extensively explored by [Zhang et al., function, and Ψ(y) ≜ minx f(x, y) as the dual function. 2020, Ostrovskii et al., 2020, Thekumparampil et al., 2019, We also define the primal-dual gap at a point (¯x, y¯) as gap (¯x, y¯) max d f(¯x, y) − min d f(x, y¯). Zhao, 2020, Nouiehed et al., 2019, Yang et al., 2020b]. f ≜ y∈R 2 x∈R 1 Recently, [Daskalakis et al., 2020] showed that for general smooth nonconvex-nonconcave objectives, finding ap- Definition 2.2 (Lipschitz Smoothness) We say a function proximate first-order locally optimal solutions is intractable. f(x, y) is L-Lipschitz smooth (L-S) jointly in x and y if it d1 d2 Therefore, another line of research is devoted to search- is differentiable and for any (x1, y1), (x2, y2) ∈ R × R , ing for solutions under additional structural properties, e.g., ∥∇xf(x1, y1) − ∇xf(x2, y2)∥ ≤ L(∥x1−x2∥+∥y1−y2∥) [Yang et al., 2020c,a, Mertikopoulos et al., 2019, Diakoniko- and ∥∇yf(x1, y1) − ∇yf(x2, y2)∥ ≤ L(∥x1−x2∥+∥y1− las et al., 2020, Lin et al., 2018b]. y2∥), for some L > 0.

Catalyst acceleration. The catalyst framework was ini- Definition 2.3 (Average / Individual Smoothness) We tially studied in [Lin et al., 2015] for convex minimization 1 Pn n say f(x, y) = fi(x, y) or {fi} is L-averaged and extended to nonconvex minimization in [Paquette et al., n i=1 i=1 smooth (L-AS) if each fi is differentiable, and for any 2018] to obtain accelerated algorithms. A similar idea to d1 d2 (x1, y1), (x2, y2) ∈ R × R , we have accelerate SVRG appeared in [Frostig et al., 2015]. These n 1 X 2 work are rooted on the proximal point algorithm (PPA) ∥∇f (x , y ) − ∇f (x , y )∥ n i 1 1 i 2 2 [Rockafellar, 1976, Güler, 1991] and inexact accelerated i=1 2 2 2 PPA [Güler, 1992]. Recently, [Yang et al., 2020b] general- ≤ L ∥x1 − x2∥ + ∥y1 − y2∥ . ized the idea and obtained state-of-the-art results for solving n strongly-convex-concave and nonconvex-concave minimax We say f or {fi}i=1 is L-individually smooth (L-IS) if each problems. In contrast, this paper introduces a new cata- fi is L-Lipschitz smooth. lyst acceleration scheme in the nonconvex-strongly-concave Average smoothness is a weaker condition than the com- setting, which relies on different parameter settings and mon Lipschitz smoothness assumption of each component stopping criterion. in finite-sum / stochastic minimization [Fang et al., 2018, Zhou and Gu, 2019]. Similarly in minimax problems, the 2 PRELIMINARIES following proposition summarizes the relationship among these different notions of smoothness. Notations Throughout the paper, we use dom F as the 1 Pn domain of a function F , ∇F = (∇xF, ∇yF ) as the full Proposition 2.1 Let f(x, y) = n i=1 fi(x, y). Then we gradient, ∥·∥ as the ℓ2-norm. We use 0 to represent zero have: (a) If the function f is L-IS or L-AS, then it is L- vectors or scalars, ei to represent unit vector with the i-th S. (b) If f is L-IS, then it is (2L)-AS.√ (c) If f is L-AS, then τx 2 τy 2 element being 1. For nonnegative functions f(x) and g(x), f(x, y)+ 2 ∥x−x˜∥ − 2 ∥y−y˜∥ is 2(L+max{τx, τy})- we say f = O(g) if f(x) ≤ cg(x) for some c > 0, and AS for any x˜ and y˜. Definition 2.4 (Strong Convexity) A differentiable func- • For the finite-sum setting, incremental first-order oracle d1 tion g : R → R is convex if g(x2) ≥ g(x1) + (IFO) is often used in lower bound analysis [Agarwal d1 ⟨∇g(x1), x2 − x1⟩ for any x1, x2 ∈ R . Given µ ≥ 0, and Bottou, 2015]. This oracle for a function f(x, y) = µ 2 1 Pn f (x, y), is such that for each query on point z we say f is µ-strongly convex if g(x) − 2 ∥x∥ is convex, n i=1 i and it is µ-strongly concave if −g is µ-strongly convex. and given i ∈ [n], it returns the gradient of the i-th component, i.e., OIFO(f, z, i) ≜ (∇xfi(x, y), ∇yfi(x, y)),. Assumption 2.1 (Main Settings) We assume that f(x, y) Here, we consider averaged smooth IFO and individually in (1) is a nonconvex-strongly-concave (NC-SC) function L,AS L,IS smooth IFO, denoted as OIFO (f) and OIFO (f), where such that f is L-S, and f(x, ·) is µ-strongly concave for any n {fi}i=1 is L-AS or L-IS, respectively. fixed x ∈ Rd1 ; for the finite-sum case, we further assume n Algorithm class In this work, we consider the class of that {fi}i=1 is L-AS. We assume that the initial primal suboptimality is bounded: Φ(x0) − infx Φ(x) ≤ ∆. linear-span algorithms interacting with oracle O, denoted Φ(·) as A(O). These algorithms satisfy the following property: Under Assumption 2.1, the primal function is differen- t 2κL if we let (z )t be the sequence of queries by the algorithm, tiable and -Lipschitz smooth [Lin et al., 2020b, Lemma t t t L where z = (x , y ); then for all t, we have 23] where κ ≜ µ . Throughout this paper, we use the sta- t+1 0 t 0 t tionarity of Φ(·) as the convergence criterion. z ∈ Span z , ··· , z ; O f, z , ··· , O f, z . (3) For the finite-sum case, the above protocol fits with manyex- Definition 2.5 (Convergence Criterion) For a differen- isting deterministic and randomized linear-span algorithms. tiable function Φ, a point x¯ ∈ dom Φ is called an ϵ- We distinguish the general and finite-sum setting by speci- stationary point of Φ if ∥∇Φ(¯x)∥ ≤ ϵ. fying the used oracle, which is OFO or OIFO, respectively. Another commonly used criterion is the stationarity of f, Most existing first-order algorithms, including simultane- i.e., ∥∇xf(¯x, y¯)∥ ≤ ϵ, ∥∇yf(¯x, y¯)∥ ≤ ϵ. This is a weaker ous and alternating update algorithms, can be formulated as convergence criterion. We refer readers to [Lin et al., 2020a, linear-span algorithms. It is worth pointing out that typically Section 4.3] for the comparison of these two criteria. the linear span assumption is used without loss of general- ity, since there is a standard reduction from deterministic 3 LOWER BOUNDS FOR NC-SC linear-span algorithms to arbitrary oracle based determinis- MINIMAX PROBLEMS tic algorithms [Nemirovsky, 1991, 1992, Ouyang and Xu, 2019]. We defer this extension for future work. In this section, we establish lower complexity bounds (LB) Complexity measures The efficiency of algorithms is for finding approximate stationary points of NC-SC mini- quantified by the oracle complexity [Nemirovski and Yudin, max problems, in both general and finite-sum settings. We 1983] of finding an ϵ-stationary point of the primal function: first present the basic components of the oracle complexity for an algorithm A ∈ A( ) interacting with a FO oracle , framework [Nemirovski and Yudin, 1983] and then proceed O O an instance f ∈ F, we define to the details for each case. For simplicity, in this section t T only, we denote xd as the d-th coordinate of x and x as the Tϵ(f, A) ≜ inf T ∈ N|∥∇Φ x ∥ ≤ ϵ (4) variable x in the t-th iteration. as the minimum number of oracle calls A makes to reach stationarity convergence. For the general case, we define the 3.1 FRAMEWORK AND SETUP worst-case complexity

We study the lower bound of finding primal stationary point Complϵ(F, A, O) ≜ sup inf Tϵ(f, A). (5) under the well-known oracle complexity framework [Ne- f∈F A∈A(O) mirovski and Yudin, 1983], here we first present the basics For finite-sum cases, we lower bound the randomized com- of the framework. plexity by the distributional complexity [Braun et al., 2017]:

Function class We consider the nonconvex-strongly- Complϵ(F, A, O) ≜ sup inf E Tϵ(f, A). (6) concave (NC-SC) function class, as defined in Assumption f∈F A∈A(O) L,µ,∆ 2.1, with parameters L, µ, ∆ > 0, denoted by FNCSC . Following the motivation of analysis discussed in Section 1.1, we will use the zero-chain argument for the analysis. Oracle class We consider different oracles for the general First we define the notion of (first-order) zero-chain [Car- and finite-sum settings. Define z ≜ (x, y). mon et al., 2019b] and activation as follows. • For the general setting, we consider the first-order or- Definition 3.1 (Zero Chain, Activation) A function f : acle (FO), denoted as OFO(f, ·), that for each query d d R → R is a first-order zero-chain if for any x ∈ R , on point z, it returns the gradient OFO(f, z) ≜ (∇xf(x, y), ∇yf(x, y)). supp{x} ⊆ {1, ··· , i−1} ⇒ supp{∇f(x)} ⊆ {1, ··· , i}, where supp{x} ≜ {i ∈ [d] | xi ̸= 0} and [d] = {1, ··· , d}. Zero-Chain Construction First we summarize key prop- d t For an algorithm initialized at 0 ∈ R , with iterates {x }t, erties of the instance and its zero-chain mechanism. We t t s d+2 we say coordinate i is activated at x , if xi ̸= 0 and xi = 0, further denote eˆi ∈ R as the unit vector for the variable for any s < t. y and define (k ≥ 1)

3.2 GENERAL NC-SC PROBLEMS Xk ≜ Span{e1, e2, ··· , ek}, X0 ≜ {0}, (11) Yk ≜ Span{eˆd+2, eˆd+1, ··· , eˆd−k+2}, Y0 ≜ {0}, First we consider the general NC-SC (Gen-NC-SC) minimax optimization problems. Following the above framework, then we have the following properties for Fd. L,µ,∆ we choose function class FNCSC , oracle OFO, linear-span algorithms A, and we analyze the complexity defined in (5). + Lemma 3.1 (Properties of Fd) For any d ∈ N and α ∈ [0, 1], F (x, y; λ, α) in (7) satisfies: Hard instance construction Inspired by the hard in- d stances constructed in [Ouyang and Xu, 2019, Carmon (i) The function Fd(x, ·; λ, α) is LF -Lipschitz smooth et al., 2019b], we introduce the following function Fd : n 2 o 200λ1α d+1 d+2 + where LF = max λ , 2λ1, 2λ2 . R × R → R (d ∈ N ) and 2√ 2 2 λ1 α Fd(x, y; λ, α) ≜ λ1⟨Bdx, y⟩ − λ2∥y∥ − ⟨e1, x⟩ d+1 2λ2 (ii) For each fixed x ∈ R , Fd(x, ·; λ, α) is µF -strongly 2 d 2 2√ (7) concave where µF = 2λ2. λ1α X λ1α 2 λ1 α + Γ(xi) − xd+1 + , 2λ2 4λ2 4λ2 (iii) The following properties hold: i=1 2 a) x = y = 0 ⇒ ∇ F ∈ X , ∇ F = 0. where λ = (λ1, λ2) ∈ R is the parameter vector, e1 ∈ x d 1 y d d+1 R is the unit vector with the only non-zero element in b) x ∈ Xk, y ∈ Yk ⇒ ∇xFd ∈ Xk+1, ∇yFd ∈ Yk. (d+2)×(d+1) the first dimension, Γ: R → R and Bd ∈ R are c) x ∈Xk+1, y ∈Yk ⇒ ∇xFd ∈ Xk+1, ∇yFd ∈ Yk+1. ∗ ∗ ∗ L µ (iv) For L ≥ µ > 0, if λ = λ = (λ1, λ2) = ( 2 , 2 ) and  1  µ α ≤ 100L , Fd is L-Lipschitz smooth, and for any fixed  1 −1 d+1   x ∈ R , Fd(x, ·; λ, α) is µ-strongly concave.  .. ..  Z x t2(t − 1) B =  . .  , Γ(x) = 120 dt. d   2  1 −1  1 1 + t   1 −1 The proof of Lemma 3.1 is deferred to Appendix C.1.1. √  4 α The first two properties show that function Fd is Lipschitz (8) smooth and NC-SC; the third property above suggests that, Matrix Bd essentially triggers the activation of variables at starting from (x, y) = (0, 0), the activation process follows each iteration, and function Γ introduces nonconvexity in an "alternating zero-chain" form [Ouyang and Xu, 2019]. x to the instance. By the first-order optimality condition of That is, for a linear-span algorithm, when x ∈ Xk, y ∈ Yk, Fd(x, ·; λ, α), we can compute its primal function, Φd: the next iterate will at most activate the (k+1)-th coordinate of x while keeping y fixed; similarly when x ∈ Xk+1, y ∈ Φd(x; λ, α) max Fd (x, y; λ, α) ≜ d+1 Y , the next iterate will at most activate the (d − k + 1)-th y∈R k 2 √ d ! element of y. We need the following properties of Φd for λ1 1 ⊤ √ α X 1 − α 2 = x Adx − αx1 + + α Γ(xi) + xd+1 , the lower bound argument. 2λ2 2 2 2 i=1 (9) Lemma 3.2 (Properties of Φd) For any α ∈ [0, 1] and (d+1)×(d+1) d+1 where Ad ∈ R is x ∈ R , if xd = xd+1 = 0, we have: ⊤ ⊤ Ad = Bd Bd − ed+1ed+1 2 λ1 3/4 √ (i) ∥∇Φd(x; λ, α)∥ ≥ α . 1 + α −1  8λ2 2 √ λ1 α  −1 2 −1  (ii) Φd(0; λ, α)− inf Φd(x; λ, α) ≤ + 10αd .   d+1 2λ2 2  ..  x∈R  −1 2 .  (10) =   .  .. ..   . . −1  We defer the proof of Lemma 3.2 to Appendix C.1.2. This    ..  lemma indicates that, starting from (x, y) = (0, 0) with  . 2 −1 appropriate parameter settings, the primal function Φd will −1 1 not approximate stationarity until the last two coordinates The resulting primal function resembles the worst-case func- are activated. Now we are ready to present our final lower tions used in lower bound analysis of minimization prob- bound result for the general NC-SC case. lems [Nesterov, 2018, Carmon et al., 2019b]. Theorem 3.1 (LB for Gen-NC-SC) For the linear-span where 0 , I ∈ d×d are the zero and identity matrices d d R first-order algorithm class A, parameters L, µ, ∆ > 0, and respectively, while the i-th matrix above is the identity ma- √ 2 ∆L ∆L κ trix. Hence, U(i)x will be the (id − d + 1)-th to the (id)-th accuracy ϵ satisfying ϵ ≤ min 6400 , 38400 , we have elements of x, similar property also holds for V(i)y. L,µ,∆ √ −2 Complϵ FNCSC , A, OFO = Ω κ∆Lϵ . (12) The hard instance construction here follows the idea of that in the deterministic hard instance (7), the basic motivation The hard instance in the proof is established based on is that its primal function will be a finite-sum form of the F in (7). We choose the scaled function f(x, y) = d primal function Φ defined in the deterministic case (9). We η2F ( x , y ; λ∗, α) as the final hard instance, which pre- d d η η choose the following functions H : d+1 × d+2 → , serves the smoothness and strong convexity (by Lemma d R R R Γn : n(d+1) → and B.3), while appropriate setting of η will help to fulfill the d R R requirements on the initial gap and large gradient norm 2√ 2 λ1 α (before thorough activation) of the primal function. The de- Hd(x, y; λ, α) ≜ λ1⟨Bdx, y⟩ − λ2∥y∥ − ⟨e1, x⟩ 2λ2 tailed statement and proof of Theorem 3.1 are presented in 2 2√ λ1α 2 λ1 α Appendix C.1.3. − xd+1 + , 4λ2 4λ2 (14) n i(d+1)−1 n X X Remark 3.1 (Tightness) The best-known upper bounds for Γd (x) ≜ Γ(xj ), 2 −2 general NC-SC problems are O(∆Lκ ϵ√ ) [Lin et al., i=1 j=i(d+1)−d ˜ −2 2 1 2020a, Bo¸t and Böhm, 2020] and O ∆ κLϵ log ϵ ¯ ¯ n(d+1) n(d+2) (i) n [Lin et al., 2020b]. Therefore, our result exhibits significant then fi, f : R × R → R, {U }i=1 ∈ (i) n gaps in terms of the dependence on ϵ and κ. In order to Orth(d+1, n(d+1), n), {V }i=1 ∈ Orth(d+2, n(d+ mitigate these gaps, we propose faster algorithms in Section 2), n) and 4. On the other hand, compared to the Ω(∆Lϵ−2) lower 2 bound for nonconvex smooth minimization [Carmon et al., ¯ (i) (i) λ1α n fi(x, y) ≜ Hd U x, V y; λ, α + Γd (x), 2019a], our result reveals an explicit dependence on κ. 2nλ2 n 1 X f¯(x, y) f¯(x, y) 3.3 FINITE-SUM NC-SC PROBLEMS ≜ n i (15) i=1 n The second case we consider is finite-sum NC-SC (FS-NC- 1 X λ2α = H U(i)x, V(i)y; λ, α + 1 Γn(x) , SC) minimax problems, for the function class F L,µ,∆, the n d 2nλ d NCSC i=1 2 linear-span algorithm class A and the averaged smooth IFO L,AS class OIFO . The complexity is defined in (6). Pd note that by denoting Γd(x) ≜ i=1 Γ(xi), it is easy to find that Hard instance construction To derive the finite-sum hard instance, we modify Fd in (7) with orthogonal matrices n i(d+1)−1 n n X X X (i) defined as follows. Γd (x) = Γ(xj) = Γd U x i=1 j=i(d+1)−d i=1 (16) Definition 3.2 (Orthogonal Matrices) For positive inte- n d + X X gers a, b, n ∈ N , we define a matrix sequence = Γ U(i)x . (i) n j {U }i=1 ∈ Orth(a, b, n) if for each i, j ∈ {1, ··· , n} i=1 j=1 and i ̸= j, U(i), U(j) ∈ Ra×b and U(i)(U(i))⊤ = I ∈ a×a (i) (j) ⊤ a×a (i) (i) R and U (U ) = 0 ∈ R . Define u ≜ U x, we summarize the properties of the above functions in the following lemma. Here the intuition for the finite-sum hard instance is combining n independent copies of the hard instance in the Lemma 3.3 (Properties of f¯) For the above functions general case (7), then appropriate orthogonal matrices will {f¯} and f¯ in , they satisfy that: convert the n independent variables with dimension d into i i (15) one variable with dimension n × d, which results in the ¯ (i) {fi}i is LF -average smooth where LF = desired hard instance. To preserve the zero chain property, r n C2 λ4α2 4 2 o (i) n (i) n 1 2 2 γ 1 λ1α 2 for {U }i=1 ∈ Orth(d + 1, n(d + 1), n), {V }i=1 ∈ max 16λ1 + 8λ2, 2 + 2 + 8λ1 . n nλ2 λ2 Orth(d + 2, n(d + 2), n), ∀n, d ∈ N+ and x ∈ Rn(d+1), y ∈ n(d+2) U(i) V(i) n ¯ 2λ2 R , we set and by concatenating (ii) f is µF -strongly concave on y where µF = n . matrices: (iii) For n ∈ N+, L ≥ 2nµ > 0, if we set λ = λ∗ = (i) ∗ ∗ p n nµ nµ U = 0d+1 ··· 0d+1 Id+1 0d+1 ··· 0d+1 , (λ , λ ) = L, , α = ∈ [0, 1], then (13) 1 2 40 2 50L (i) ¯ ¯ V = 0d+2 ··· 0d+2 Id+2 0d+2 ··· 0d+2 , {fi}i is L-AS and f is µ-strongly concave on y. (iv) With Φ is defined in (9), let Φ(¯ x) max f¯(x, y), 4 FASTER ALGORITHMS FOR NC-SC d ≜ y then we have MINIMAX PROBLEMS n 1 X Φ(¯ x) = Φ¯ (x), Φ¯ (x) Φ (U(i)x). (17) n i i ≜ d In this section, we introduce a generic Catalyst acceleration i=1 scheme that turns existing optimizers for (finite-sum) SC-SC We defer the proof of Lemma 3.3 to Appendix C.2.1. From minimax problems into efficient, near-optimal algorithms Lemma 3.2, we have for (finite-sum) NC-SC minimax optimization. Rooted in n 1 X the inexact accelerated proximal point algorithm, the idea of Φ(0)¯ − inf Φ(¯ x) ≤ sup Φ(0)¯ − Φ¯ i(x) x∈ n(d+1) n d+1 Catalyst acceleration was introduced in Lin et al. [2015] for R i=1 x∈R √ (18) convex minimization and later extended to nonconvex mini- λ2 α ≤ 1 + 10αd . mization in Paquette et al. [2018] and nonconvex-concave 2λ 2 2 minimax optimization in Yang et al. [2020b]. In stark con- Define the index set I as all the indices i ∈ [n] such that trast, we focus on NC-SC minimax problems. u(i) = u(i) = 0, ∀i ∈ I. Suppose that |I| > n , by d d+1 2 The backbone of our Catalyst framework is to repeatedly orthogonality and Lemma 3.2 we have solve regularized subproblems of the form: n 2 n 2 ¯ 2 1 X ¯ 1 X (i) ∇Φ(x) = ∇Φi(x) = ∇Φd u 2 τ 2 n n2 min max f(x, y) + L∥x − x˜t∥ − ∥y − y˜t∥ , i=1 i=1 d d x∈R 1 y∈R 2 2 2 2 2 4 1 X (i) 1 n λ1 3 λ1 3 ≥ ∇Φ u ≥ α 4 = α 2 . 2 d 2 2 where x˜t and y˜t are carefully chosen prox-centers, and the n n 2 8λ2 128nλ2 i∈I parameter τ ≥ 0 is selected such that the condition numbers (19) for x-component and y-component of these subproblems Now we arrive at our final theorem for the averaged smooth are well-balanced. Since f is L-Lipschitz smooth and µ- FS-NC-SC case as follows. strongly concave in y, the above auxiliary problem is L- strongly convex in x and (µ + τ)-strongly concave in y. Theorem 3.2 (LB for AS FS-NC-SC) For the linear-span Therefore, it can be solved by a wide family of off-the-shelf algorithm class A, parameters L, µ, ∆ > 0 and component first-order algorithms with linear convergence rate. size n ∈ +, if L ≥ 2nµ > 0, the accuracy ϵ satisfies N √ 2 2 2 Our Catalyst framework, presented in Algorithm 1, consists that ϵ2 ≤ min αL ∆ , αL ∆ , L ∆ where α = nµ ∈ 76800nµ 1280nµ µ 50L of three crucial components: an inexact proximal point step [0, 1], then we have for primal update, an inexact accelerated proximal point √ step for dual update, and a linear-convergent algorithm for Compl F L,µ,∆, A, L,AS = Ωn + nκ∆Lϵ−2. (20) ϵ NCSC OIFO solving the subproblems. The theorem above indicates that for any A ∈ A, we can 1 Pn construct a function f(x, y) = n i=1 fi(x, y), such that Inexact proximal point step in the primal. For the x- L,µ,∆ t T f ∈ FNCSC and {fi}i is L-AS, and A requires at least update in the outer loop, {x0}t=1, can be viewed as applying √ −2 Ω n + nκ∆Lϵ IFO calls to attain an approximate sta- an inexact proximal point method to the primal function tionary point of its primal function (in terms of expectation). Φ(x), requiring to solve a sequence of auxiliary problems: ¯ ¯ The hard instance construction is based on f and fi above (15), combined with a scaling trick similar to the one in h ˆ t 2i min max ft(x, y) f(x, y) + L∥x − x ∥ . (⋆) d d ≜ 0 the general case. Also we remark that lower bound holds x∈R 1 y∈R 2 for small enough ϵ, while the requirement on ϵ is compara- ble to those in existing literature, e.g. [Zhou and Gu, 2019, Inexact proximal point methods have been explored in min- Han et al., 2021]. The detailed statement and proof of the imax optimization in several work, e.g. [Lin et al., 2020b, theorem are deferred to Appendix C.2.2. Rafique et al., 2018]. Our scheme is distinct from these work in two aspects: (i) we introduce a new subroutine to approximately solve the auxiliary problems (⋆) with near- Remark 3.2 (Tightness) The state-of-the-art√ upper bound for NC-SC finite-sum problems is O˜(n + nκ2∆Lϵ−2) optimal complexity, and (ii) the inexactness is measured by when n ≥ κ2 and Onκ + κ2∆Lϵ−2 when n ≤ κ2 an adaptive stopping criterion using the gradient norms: [Luo et al., 2020]. Note that there is still a large gap be- ∥∇fˆ(xt+1, yt+1)∥2 ≤ α ∥∇fˆ(xt , yt )∥2, (21) tween upper and lower bounds on the dependence in terms t 0 0 t t 0 0 of κ and n, which motivates the design of faster algorithms where {αt}t is carefully chosen. Using the adaptive stop- for FS-NC-SC case, we address this in Section 4. Note that ping criterion significantly reduces the complexity of solv- a weaker result on the lower bound of nonconvex finite- √ −2 ing the auxiliary problems. We will show that the number sum averaged smooth minimization is Ω( n∆Lϵ ) [Fang of steps required is only logarithmic in L, µ without any et al., 2018, Zhou and Gu, 2019]; here, our result presents dependence on target accuracy ϵ. Although the auxiliary explicitly the dependence on κ. Algorithm 1 Catalyst for NC-SC Minimax Problems if M is a deterministic algorithm; or taking expectation to

the left-hand side above if M is randomized. The choices for Input: objective f, initial point (x0, y0), smoothness constant L, strong-concavity const. µ, and param. τ > 0. M include, but are not limited to, extragradient (EG) [Tseng, 1995], optimistic gradient descent ascent (OGDA) [Gidel 1: Let (x0, y0) = (x , y ) and q = µ . 0 0 0 0 µ+τ et al., 2018], SVRG [Balamurugan and Bach, 2016], SPD1- 2: for all t = 0, 1, ..., T do VR [Tan et al., 2018], SVRE [Chavdarova et al., 2019], 3: Let z = yt and k = 1. 1 0 Point-SAGA [Luo et al., 2019], and variance reduced prox- 4: Let fˆ(x, y) f(x, y) + L∥x − xt ∥2. t ≜ 0 method [Carmon et al., 2019c]. For example, in the case of 5: repeat M L+max{2L,τ} t t EG, Λµ,L(τ) = 4 min{L,µ+τ} [Tseng, 1995]. 6: Find inexact solution (xk, yk) to the problem below by algorithm M with initial point t t (xk−1, yk−1): 4.1 CONVERGENCE ANALYSIS ˜ min max [ft,k(x, y) d d ≜ (⋆⋆) x∈R 1 y∈R 2 t 2 τ 2 In this section, we analyze the complexity of each of the f(x, y) + L∥x − x0∥ − ∥y − zk∥ ] 2 three components we discussed. Let T denote the outer-loop such that ∥∇f˜ (xt , yt )∥2 ≤ ϵt . t,k √k k k complexity, K the inner-loop complexity, and N the number q−q t √ t t 7: Let zk+1 = yk + q+q (yk − yk−1), k = k + 1. of iterations for M (expected number if M is randomized) ˆ t t 2 ˆ t t 2 to solve subproblem (⋆⋆). The total complexity of Algorithm 8: until ∥∇ft(xk, yk)∥ ≤ αt∥∇ft(x0, y0)∥ t+1 t+1 t t 1 is computed by multiplying K,T and N. Later, we will 9: Set (x0 , y0 ) = (xk, yk). 10: end for provide a guideline for choosing parameter τ to achieve the 1 T best complexity, given an algorithm M. Output: xˆT , which is uniformly sampled from x0, ..., x0 . Theorem 4.1 (Outer loop) Suppose function f is NC-SC with strong convexity parameter µ and L-Lipschitz smooth. problem is (L, µ)-SC-SC and can be solved with linear con- µ5 µ5 If we choose αt = 504L5 for t > 0 and α0 = 576 max{1,L7} , vergence by algorithms such as extragradient, OGDA, etc., the output xˆT from Algorithm 1 satisfies these algorithms are not optimal in terms of the dependency on the condition number when L > µ [Zhang et al., 2019]. 268L 28L ∥∇Φ(ˆx )∥2 ≤ ∆ + D0, (23) E T 5T 5T y

Inexact accelerated proximal point step in the dual. To 0 ∗ 2 where ∆ = Φ(x0) − infx Φ(x),Dy = ∥y0 − y (x0)∥ and solve the auxiliary problem with optimal complexity, we ∗ y (x ) = arg max d f(x , y). introduce an inexact accelerated proximal point scheme. The 0 y∈R 2 0 key idea is to add an extra regularization in y to the objective This theorem implies that the algorithm finds an ϵ sta- such that the strong-convexity and strong-concavity are well- tionary point of Φ after inexactly solving (⋆) for T = 0 −2 0 balanced. Therefore, we propose to iteratively solve the O L(∆ + Dy)ϵ times. The dependency on Dy can be subproblems: eliminated if we select the initialization y0 close enough to h τ i ∗ ˜ ˆ 2 y (x0), which only requires an additional logarithmic cost min max ft,k(x, y) ft(x, y) − ∥y − zk∥ , (⋆⋆) d d ≜ x∈R 1 y∈R 2 2 by maximizing a strongly concave function. where {z } is updated analogously to Nesterov’s acceler- k k Theorem 4.2 (Inner loop) Under the same assumptions in ated method [Nesterov, 2005] and τ ≥ 0 is the regulariza- √ t 2µ k t t Theorem 4.1, if we choose ϵ = (1 − ρ) gap ˆ (x , y ) tion parameter. For example, by setting τ = L − µ, the sub- k 2 ft 0 0 √ q problems become (L, L)-SC-SC and can be approximately µ with ρ < q = µ+τ , we have solved by extragradient method with optimal complexity, to ˆ t t 2 ∥∇ft(xk, yk)∥ be discussed in more details in next section. Finally, when √ 2 2 solving these subproblems, we use the following stopping 5508L 18 2L k t t 2 ≤ √ + (1 − ρ) ∥∇fˆt(x , y )∥ . ˜ 2 t µ2( q − ρ)2 µ 0 0 criterion ∥∇ft,k(x, y)∥ ≤ ϵk with time-varying accuracy t ϵk that decays exponentially with k. √ Particularly, setting ρ = 0.9 q, Theorem 4.2 implies after Linearly-convergent algorithms for SC-SC subproblems. inexactly solving (⋆⋆) for K = O˜ p(τ + µ)/µ log 1 αt Let M be any algorithm that solves the subproblem (⋆⋆) times, the stopping criterion (21) is satisfied. This complex- ∗ ∗ (denoting (x , y ) as the optimal solution) at a linear con- ity decreases with τ. However, we should not choose τ too vergence rate such that after N iterations: small, because the smaller τ is, the harder it is for M to solve (⋆⋆). The following theorem captures the complexity ∥x − x∗∥2 + ∥y − y∗∥2 N N for algorithm M to solve the subproblem. N ! (22) 1 ∗ 2 ∗ 2 Theorem 4.3 (Complexity of solving subproblems (⋆⋆)) ≤ 1 − M [∥x0 − x ∥ + ∥y0 − y ∥ ], Λµ,L(τ) Under the same assumptions in Theorem 4.1 and the choice of ϵt in Theorem 4.2, the number of iterations (expected Catalyst-SVRG/SAGA algorithm. When solving NC- k number of iterations if M is stochastic) for M to solve (⋆⋆) SC minimax problems in the averaged smooth finite-sum ˜ 2 t such that ∥∇ft,k(x, y)∥ ≤ ϵk is setting, we set M to be either√ SVRG or SAGA. Hence, 2 we have ΛM (τ) ∝ n + L+ 2 max{2L,τ} [Balamurugan µ,L min{L,µ+τ} M max{1, L, τ} 2 3 M √ N = O Λµ,L(τ) log . and Bach, 2016] . Minimizing Λµ,L(τ) µ + τ, the best min{1, µ} n o √L choice for τ is (proportional to) max n − µ, 0 , which The above result implies that the subproblems can be solved leads to the total complexity of within constant iterations that only depends on L, µ, τ and M Λ . This largely benefits from the use of warm-starting 3 √ 0 −2 µ,L O˜ n + n 4 κ L(∆ + D )ϵ . (26) and stopping criterion with time-varying accuracy. In con- y trast, other inexact proximal point algorithms in minimax Remark 4.2 According to the lower bound established optimization, such as [Yang et al., 2020b, Lin et al., 2020b], in Theorem 3.2, the dependency on κ in the above up- fix the target accuracy, thus their complexity of solving the per bound is nearly tight, up to logarithmic factors. Re- subproblems usually has an extra logarithmic factor in 1/ϵ. call that SREDA√ [Luo et al., 2020] achieves the com- The overall complexity of the algorithm follows immedi- plexity of O˜ κ2 nϵ−2 + n + (n + κ) log(κ) for n ≥ κ2 ately after combining the above three theorems: and O κ2 + κn ϵ−2 for n ≤ κ2. Hence, our Catalyst- SVRG/SAGA algorithm attains better complexity in the 4 Corollary 4.1 Under the same assumptions in Theorem regime√ n ≤ κ . Particularly, in the critical regime κ = 4.1 and setting in Theorem 4.2, the total number (expected Ω( n) arising in statistical learning [Shalev-Shwartz and number if M is randomized) of gradient evaluations for Ben-David, 2014], our algorithm performs strictly better. Algorithm 1 to find an ϵ-stationary point of Φ, is ! 5 CONCLUSION ΛM (τ)L(∆ + D0)rµ + τ O˜ µ,L y . (24) ϵ2 µ In this work, we take an initial step towards understanding the fundamental limits of minimax optimization in the In order to minimize the total complexity, we should nonconvex-strongly-concave setting for both general and choose the regularization parameter τ that minimizes √ finite-sum cases, and bridge the gaps between lower and up- ΛM (τ) µ + τ. µ,L per bounds. It remains interesting to investigate whether the dependence on n can be further tightened in the complexity 4.2 SPECIFIC ALGORITHMS AND for finite-sum NC-SC minimax optimization. COMPLEXITIES Acknowledgements In this subsection, we discuss specific choices for M and the corresponding optimal choices of τ, as well as the resulting This work is supported by NSF CMMI #1761699, NSF total complexities for solving NC-SC problems. CCF #1934986. CG’s research is partially supported by IN- Catalyst-EG/OGDA algorithm. When solving NC-SC RIA through the INRIA Associate Teams, and FONDECYT minimax problems in the general setting, we set M to 1210362 project. NK’s research is supported by NSF CCF be either extra-gradient method (EG) or optimistic gradi- #1704970. M ent descent ascent (OGDA). Hence, we have Λµ,L(τ) = L+max{2L,τ} 4 min{L,µ+τ} [Gidel et al., 2018, Azizian et al., 2020]. Min- References M √ imizing Λµ,L(τ) µ + τ yields that the optimal choice for τ is L − µ. This leads to a total complexity of Alekh Agarwal and Leon Bottou. A lower bound for the optimization of finite sums. In International conference on machine √ ˜ 0 −2 learning, pages 78–86. PMLR, 2015. O κL(∆ + Dy)ϵ . (25)

Remark 4.1 The above complexity matches the lower bound in Theorem 3.1, up to a logarithmic factor in L and 2Although Balamurugan and Bach [2016] assumes individual κ. It improves over Minimax-PPA [Lin et al., 2020b] by smoothness, its analysis can be extended to average smoothness. 2 3 3 log (1/ϵ), GDA [Lin et al., 2020a] by κ 2 and therefore SVRG in [Balamurugan and Bach, 2016] requires computing achieves the best of two worlds in terms of dependency on κ the proximal operator of an (µ, µ)-SC-SC function. For any (µ, µ)- SC-SC function in the form of P f (x, y), we can rewrite it as and ϵ. In addition, our Catalyst-EG/OGDA algorithm does i i P [f (x, y)− µ ||x||2 + µ ||y||2]+ µ (||x||2 −||y||2), where the not require the bounded domain assumption on y, unlike i i 2n 2n 2 first term is convex-concave, and the second term is (µ, µ)-SC-SC [Lin et al., 2020b]. and admits a simple proximal operator. Zeyuan Allen-Zhu. How to make the gradients small stochasti- dan. Efficient methods for structured nonconvex-nonconcave cally: even faster convex and nonconvex sgd. In Proceedings min-max optimization. arXiv:2011.00364, 2020. of the 32nd International Conference on Neural Information Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of min- Processing Systems, pages 1165–1175, 2018. imizing compositions of convex functions and smooth maps. Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Mathematical Programming, 178(1):503–558, 2019. Srebro, and Blake Woodworth. Lower bounds for non-convex Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. stochastic optimization. arXiv:1912.02365, 2019. Spider: Near-optimal non-convex optimization via stochastic Waïss Azizian, Ioannis Mitliagkas, Simon Lacoste-Julien, and path-integrated differential estimator. In Advances in Neural Gauthier Gidel. A tight and unified analysis of gradient-based Information Processing Systems, pages 689–699, 2018. methods for a whole spectrum of differentiable games. In In- Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un- ternational Conference on Artificial Intelligence and Statistics, regularizing: approximate proximal point and faster stochastic pages 2863–2873, 2020. algorithms for empirical risk minimization. In International P Balamurugan and Francis Bach. Stochastic variance reduction Conference on Machine Learning. PMLR, 2015. methods for saddle-point problems. In Proceedings of the 30th Gauthier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, International Conference on Neural Information Processing and Simon Lacoste-Julien. A variational inequality perspective Systems, pages 1416–1424, 2016. on generative adversarial networks. In International Conference Radu Ioan Bo¸tand Axel Böhm. Alternating proximal-gradient on Learning Representations, 2018. steps for (stochastic) nonconvex-concave minimax problems. Osman Güler. On the convergence of the proximal point algo- arXiv preprint arXiv:2007.13605, 2020. rithm for convex minimization. SIAM Journal on Control and Gábor Braun, Cristóbal Guzmán, and Sebastian Pokutta. Lower Optimization, 29(2):403–419, 1991. bounds on the oracle complexity of nonsmooth convex optimiza- Osman Güler. New proximal point algorithms for convex mini- tion via information theory. IEEE Transactions on Information mization. SIAM Journal on Optimization, 2(4):649–664, 1992. Theory, 63(7):4709–4724, 2017. Zhishuai Guo, Mingrui Liu, Zhuoning Yuan, Li Shen, Wei Liu, and Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Tianbao Yang. Communication-efficient distributed stochastic Lower bounds for finding stationary points i. Mathematical auc maximization with deep neural networks. In International Programming, pages 1–50, 2019a. Conference on Machine Learning, pages 3864–3874. PMLR, Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. 2020. Lower bounds for finding stationary points ii: first-order meth- Yuze Han, Guangzeng Xie, and Zhihua Zhang. Lower complexity ods. Mathematical Programming, pages 1–41, 2019b. bounds of finite-sum optimization problems: The results and Yair Carmon, Yujia Jin, Aaron Sidford, and Kevin Tian. Variance construction. arXiv preprint arXiv:2103.08280, 2021. reduction for matrix games. In Advances in Neural Information Jiawei Huang and Nan Jiang. On the convergence rate of density- Processing Systems, pages 11381–11392, 2019c. ratio basedoff-policy policy gradient methods. In Neural In- Tatjana Chavdarova, Gauthier Gidel, François Fleuret, and Simon formation Processing Systems Offline Reinforcement Learning Lacoste-Julien. Reducing noise in gan training with variance Workshop, 2020. reduced extragradient. In Advances in Neural Information Pro- Adam Ibrahim, Waıss Azizian, Gauthier Gidel, and Ioannis cessing Systems, pages 393–403, 2019. Mitliagkas. Linear lower bounds and conditioning of differen- Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learn- tiable games. In International Conference on Machine Learning, ing from conditional distributions via dual embeddings. In pages 4583–4593. PMLR, 2020. Artificial Intelligence and Statistics, pages 1458–1467, 2017. Chi Jin, Praneeth Netrapalli, and Michael Jordan. What is local Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, optimality in nonconvex-nonconcave minimax optimization? In Jianshu Chen, and Le Song. SBEED: Convergent reinforcement International Conference on Machine Learning, 2020. learning with nonlinear function approximation. In Interna- Qi Lei, Jason Lee, Alex Dimakis, and Constantinos Daskalakis. tional Conference on Machine Learning, 2018. Sgd learns one-layer networks in wgans. In International Con- Marina Danilova, Pavel Dvurechensky, Alexander Gasnikov, Ed- ference on Machine Learning, pages 5799–5808. PMLR, 2020. uard Gorbunov, Sergey Guminov, Dmitry Kamzolov, and In- Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal nokentiy Shibaev. Recent theoretical advances in non-convex catalyst for first-order optimization. Advances in Neural Infor- optimization. arXiv preprint arXiv:2012.06188, 2020. mation Processing Systems, 2015. Constantinos Daskalakis, Stratis Skoulakis, and Manolis Zam- Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. Catalyst ac- petakis. The complexity of constrained min-max optimization. celeration for first-order convex optimization: from theory to arXiv preprint arXiv:2009.09623, 2020. practice. Journal of Machine Learning Research, 18(1):7854– Jelena Diakonikolas. Halpern iteration for near-optimal and 7907, 2018a. parameter-free monotone inclusion and strong solutions to vari- Qihang Lin, Mingrui Liu, Hassan Rafique, and Tianbao Yang. ational inequalities. In Conference on Learning Theory, pages Solving weakly-convex-weakly-concave saddle-point problems 1428–1451. PMLR, 2020. as weakly-monotone variational inequality. arXiv preprint Jelena Diakonikolas and Cristóbal Guzmán. Complementary arXiv:1810.10207, 5, 2018b. composite minimization, small gradients in general norms, Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent and applications to regression problems. arXiv preprint ascent for nonconvex-concave minimax problems. In Interna- arXiv:2101.11041, 2021. tional Conference on Machine Learning, 2020a. Jelena Diakonikolas, Constantinos Daskalakis, and Michael I Jor- Tianyi Lin, Chi Jin, and Michael I Jordan. Near-optimal algorithms learning: From theory to algorithms. Cambridge university for minimax optimization. In Conference on Learning Theory, press, 2014. pages 2738–2779. PMLR, 2020b. Aman Sinha, Hongseok Namkoong, and John Duchi. Certify- Luo Luo, Cheng Chen, Yujun Li, Guangzeng Xie, and Zhihua ing some distributional robustness with principled adversarial Zhang. A stochastic proximal point algorithm for saddle-point training. In ICLR, 2018. problems. arXiv preprint arXiv:1909.06946, 2019. Conghui Tan, Tong Zhang, Shiqian Ma, and Ji Liu. Stochastic Luo Luo, Haishan Ye, Zhichao Huang, and Tong Zhang. Stochas- primal-dual method for empirical risk minimization with o(1) tic recursive gradient descent ascent for stochastic nonconvex- per-iteration complexity. In Advances in Neural Information strongly-concave minimax problems. Advances in Neural Infor- Processing Systems, pages 8366–8375, 2018. mation Processing Systems, 33, 2020. Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan- Sewoong Oh. Efficient algorithms for smooth minimax opti- Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Opti- mization. Advances in Neural Information Processing Systems, mistic mirror descent in saddle-point problems: Going the extra 32:12680–12691, 2019. (gradient) mile. In ICLR, 2019. Joseph F. Traub, G. W. Wasilkowski, and H. Wo´zniakowski. Arkadi. S. Nemirovski and David. B. Yudin. Problem Complexity Information-based complexity. Boston, MA: Academic Press, and Method Efficiency in Optimization. Wiley, 1983. Inc., 1988. ISBN 0-12-697545-0. Arkadi S. Nemirovsky. On optimality of krylov’s information when Paul Tseng. On linear convergence of iterative methods for the solving linear operator equations. J. Complex., 7(2):121–130, variational inequality problem. Journal of Computational and 1991. Applied Mathematics, 60(1-2):237–252, 1995. Arkadi S Nemirovsky. Information-based complexity of linear Guangzeng Xie, Luo Luo, Yijiang Lian, and Zhihua Zhang. Lower operator equations. Journal of Complexity, 8(2):153–175, 1992. complexity bounds for finite-sum convex-concave minimax op- Yu Nesterov. Smooth minimization of non-smooth functions. timization problems. In International Conference on Machine Mathematical programming, 103(1):127–152, 2005. Learning, 2020. Yurii Nesterov. How to make the gradients small. Optima. Mathe- Zi Xu, Huiling Zhang, Yang Xu, and Guanghui Lan. A uni- matical Optimization Society Newsletter, pages 10–11, 2012. fied single-loop alternating gradient projection algorithm for Yurii Nesterov. Lectures on convex optimization, volume 137. nonconvex-concave and convex-nonconcave minimax problems. Springer, 2018. arXiv preprint arXiv:2006.02032, 2020. Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, Junchi Yang, Negar Kiyavash, and Niao He. Global convergence and Meisam Razaviyayn. Solving a class of non-convex min- and variance reduction for a class of nonconvex-nonconcave max games using iterative first order methods. Advances in minimax problems. Advances in Neural Information Processing Neural Information Processing Systems, 2019. Systems, 33, 2020a. Dmitrii M Ostrovskii, Andrew Lowy, and Meisam Raza- Junchi Yang, Siqi Zhang, Negar Kiyavash, and Niao He. A catalyst viyayn. Efficient search of first-order nash equilibria in framework for minimax optimization. In Advances in Neural nonconvex-concave smooth min-max problems. arXiv preprint Information Processing Systems, 2020b. arXiv:2002.07919, 2020. Yingxiang Yang, Negar Kiyavash, Le Song, and Niao He. The Yuyuan Ouyang and Yangyang Xu. Lower complexity bounds of devil is in the detail: A framework for macroscopic prediction first-order methods for convex-concave bilinear saddle-point via microscopic models. Advances in Neural Information Pro- problems. Mathematical Programming, pages 1–35, 2019. cessing Systems, 33, 2020c. Courtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Andrew Chi-Chin Yao. Probabilistic computations: Toward a Mairal, and Zaid Harchaoui. Catalyst for gradient-based non- unified measure of complexity. In 18th Annual Symposium on convex optimization. In International Conference on Artificial Foundations of Computer Science (sfcs 1977), pages 222–227. Intelligence and Statistics, pages 613–622. PMLR, 2018. IEEE Computer Society, 1977. Qi Qian, Shenghuo Zhu, Jiasheng Tang, Rong Jin, Baigui Sun, TaeHo Yoon and Ernest K Ryu. Accelerated algorithms for 2 and Hao Li. Robust optimization over multiple domains. In smooth convex-concave minimax problems with O(1/k ) rate Proceedings of the AAAI Conference on Artificial Intelligence, on squared gradient norm. arXiv:2102.07922, 2021. volume 33, pages 4739–4746, 2019. Jiawei Zhang, Peijun Xiao, Ruoyu Sun, and Zhi-Quan Luo. Hassan Rafique, Mingrui Liu, Qihang Lin, and Tianbao Yang. Non- A single-loop smoothed gradient descent-ascent algorithm convex min-max optimization: Provable algorithms and appli- for nonconvex-concave min-max problems. arXiv preprint cations in machine learning. arXiv preprint arXiv:1810.02060, arXiv:2010.15768, 2020. 2018. Junyu Zhang, Mingyi Hong, and Shuzhong Zhang. On lower R Tyrrell Rockafellar. Monotone operators and the proximal point iteration complexity bounds for the saddle point problems. arXiv algorithm. SIAM Journal on Control and Optimization, 1976. preprint arXiv:1912.07481, 2019. Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D Lee. Renbo Zhao. A primal dual smoothing framework for On the convergence and robustness of training gans with regular- max-structured nonconvex optimization. arXiv preprint ized optimal transport. In Proceedings of the 32nd International arXiv:2003.04375, 2020. Conference on Neural Information Processing Systems, pages Dongruo Zhou and Quanquan Gu. Lower bounds for smooth 7091–7101, 2018. nonconvex finite-sum optimization. In International Conference Shai Shalev-Shwartz and Shai Ben-David. Understanding machine on Machine Learning, pages 7574–7583. PMLR, 2019.