Kernel Distributionally Robust Optimization: Generalized Duality Theorem and Stochastic Approximation

Jia-Jie Zhu Wittawat Jitkrittum Empirical Inference Department Empirical Inference Department Max Planck Institute for Intelligent Systems Max Planck Institute for Intelligent Systems Tübingen, Germany Tübingen, Germany [email protected] Currently at Google Research, NYC, USA [email protected]

Moritz Diehl Bernhard Schölkopf Department of Microsystems Engineering Empirical Inference Department & Department of Mathematics Max Planck Institute for Intelligent Systems University of Freiburg Tübingen, Germany Freiburg, Germany [email protected] [email protected]

Abstract functional gradient, which apply to general optimization and machine learning tasks.

We propose kernel distributionally robust op- timization (Kernel DRO) using insights from 1 INTRODUCTION the robust optimization theory and functional analysis. Our method uses reproducing kernel Imagine a hypothetical scenario in the illustrative figure Hilbert spaces (RKHS) to construct a wide where we want to arrive at a destination while avoiding range of convex ambiguity sets, which can be unknown obstacles. A worst-case robust optimization generalized to sets based on integral probabil- (RO) (Ben-Tal et al., 2009) approach is then to avoid ity metrics and finite-order bounds. the entire unsafe area (left, blue). Suppose we have This perspective unifies multiple existing ro- historical locations of the obstacles (right, dots). We bust and stochastic optimization methods. may choose to avoid only the convex polytope that We prove a theorem that generalizes the clas- contains all the samples (pink). This data-driven robust sical duality in the mathematical problem of decision-making idea improves efficiency while retaining moments. Enabled by this theorem, we re- robustness. formulate the maximization with respect to measures in DRO into the dual program that searches for RKHS functions. Using universal RKHSs, the theorem applies to a broad class of loss functions, lifting common limitations The concept of distributional ambiguity concerns the such as polynomial losses and knowledge of uncertainty of uncertainty — the underlying probability the Lipschitz constant. We then establish a measure is only partially known or subject to change. connection between DRO and stochastic op- This idea is by no means a new one. The classical timization with expectation constraints. Fi- moment problem concerns itself with estimating the R nally, we propose practical algorithms based worst-case risk expressed by maxP ∈K l dP where l is on both batch convex solvers and stochastic some loss function. The constraint P ∈ K describes the distribution ambiguity, i.e., P is only known to live Proceedings of the 24th International Conference on Artifi- within a subset K of probability measures. The solution cial Intelligence and Statistics (AISTATS) 2021, San Diego, to the moment problem gives the risk under some California, USA. PMLR: Volume 130. Copyright 2021 by worst-case distribution within K. To make decisions the author(s). that will minimize this worst-case risk is the idea of Kernel Distributionally Robust Optimization distributionally robust optimization (DRO) (Delage and P := P(X ) denotes the set of all Borel probability Ye, 2010; Scarf, 1958). measures on X . We use Pˆ to denote the empirical dis- ˆ PN 1 tribution P = i=1 N δξi , where δ is a Dirac measure Many of today’s learning tasks suffer from various man- N and {ξi} are data samples. We refer to the function ifestations of distributional ambiguity — e.g., covariate i=1 δC(x) := 0 if x ∈ C, ∞ if x∈ / C, as the indicator func- shift, adversarial attacks, simulation to reality trans- ∗ tion. δ (f) := sup hf, µiH is the function of fer — phenomena that are caused by the discrepancy C µ∈C C. SN denotes the N-dimensional simplex. ri(·) denotes between training and test distributions. Kernel meth- the relative interior of a set. A function f is upper semi- ods are known to possess robustness properties, e.g., continuous on X if lim sup f(x) ≤ f(x0), ∀x0 ∈ X ; (Christmann and Steinwart, 2007; Xu et al., 2009). x→x0 it is proper if it is not identically −∞. When there is no However, this robustness only applies to kernelized ambiguity, we simplify the loss function notation l(θ, ·) models. This paper extends the robustness of kernel by using l to indicate that results hold for θ point-wise. methods using the robust counterpart formulation tech- niques (Ben-Tal et al., 2009) as well as the principled 2.1 Robust and distributionally robust conic duality theory (Shapiro, 2001). We term our optimization approach kernel distributionally robust optimization (Kernel DRO), which can robustify general optimiza- Robust optimization (RO) (Ben-Tal et al., 2009) stud- tion solutions not limited to kernelized models. ies mathematical decision-making under uncertainty. The main contributions of this paper are: It solves the min-max problem (omitting constraints) minθ sup l(θ, ξ), where l(θ, ξ) denotes a general loss 1. We rigorously prove the generalized duality theo- ξ∈X function, θ is the decision variable, and ξ is a variable rem (Theorem 3.1) that reformulates general DRO representing the uncertainty. Intuitively, RO makes into a convex dual problem searching for RKHS the decision assuming an adversarial scenario, as re- functions, lifting common limitations of DRO on flected in taking supremum w.r.t. ξ. For this reason, the loss functions, such as the knowledge of Lip- it is often referred to as the worst-case RO. Recently, schitz constant. The theorem also constitutes a RO has been applied to the setting of adversarially generalization of the duality results from the liter- robust learning, e.g., in (Madry et al., 2019; Wong and ature of mathematical problem of moments. Kolter, 2018), which we will visit in this paper. In the 2. We use RKHSs to construct a wide range of convex optimization literature, a typical approach to solving ambiguity sets (in Table 1, 3), including sets based RO is via reformulating the min-max program using on integral probability metrics (IPM) and finite- duality to obtain a single minimization problem. In order moment bounds. This perspective unifies contrast, distributionally robust optimization (DRO) existing RO and DRO methods. minimizes the expected loss assuming the worst-case 3. We propose computational algorithms based on distribution: both convex solvers and stochastic approximation, which can be applied to robustify general optimiza-  Z  min sup l(θ, ξ) dP (ξ) , (1) tion and machine learning models not limited to θ P ∈K kernelized or known-Lipschitz-constant ones. 4. Finally, we establish an explicit connection be- where K ⊆ P, called the ambiguity set, is a subset tween DRO and stochastic optimization with of distributions, e.g., all distributions with the given expectation constraints. This leads to a novel mean and . Compared with RO, DRO only stochastic functional gradient DRO (SFG-DRO) robustifies the solution against a subset K of distribu- algorithm which can scale up to modern machine tions on X and is, therefore, less conservative (since R learning tasks. supP ∈K{ l(θ, ξ) dP (ξ)} ≤ supξ∈X l(θ, ξ)). In addition, we give complete self-contained proofs in The inner problem of (1), historically known as the the appendix that shed light on the connection be- problem of moments traced back at least to Thomas tween RKHSs, conic duality, and DRO. We also show Joannes Stieltjes, estimates the worst-case risk under that universal RKHSs are large enough for DRO from uncertainty in distributions. The modern approaches, the perspective of functional analysis through concrete pioneered by the work of (Isii, 1962) (see also (Lasserre, examples. 2002; Shapiro, 2001; Bertsimas and Popescu, 2005; Popescu, 2005; Vandenberghe et al., 2007; Van Parys et al., 2016; Zhu et al., 2020)), typically seek a sharp 2 BACKGROUND upper bound via duality. This duality, rigorously jus- tified in (Shapiro, 2001), is different from that in the Notation. X ⊂ Rd denotes the input domain, which Euclidean space because infinite-dimensional convex is assumed to be compact unless otherwise specified. sets can become pathological. Using that methodol- Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

ogy, we can reformulate DRO (1) into a single solvable 2007, Section 1.2). The reproducing property allows minimization problem. one to easily compute the expectation of any function f ∈ H since Ex∼P [f(x)] = hf, µP iH. Embedding distri- Existing DRO approaches can be grouped into three butions into H also allows one to measure the distance main categories by the type of ambiguity sets used. between distributions in H. If k is universal, then the DRO with (finite-order) moment constraints has been mean map P 7→ µP is injective on P (Gretton et al., studied in (Delage and Ye, 2010; Scarf, 1958; Zymler 2012). With a universal H, given two distributions P,Q, et al., 2013). The authors of (Ben-Tal et al., 2013; kµP − µQkH defines a metric. This quantity is known Iyengar, 2005; Nilim and El Ghaoui, 2005; Wang et al., as the maximum mean discrepancy (MMD) (Gretton 2016; Duchi et al., 2016) studied DRO using likelihood p et al., 2012). With kfkH := hf, fiH and the repro- bounds as well as φ−divergence. Wasserstein-distance- 2 ducing property, it can be shown that kµP − µQkH = based DRO has been studied by the authors of (Mo- 0 0 Ex,x0∼P k(x, x ) + Ey,y0∼Qk(y, y ) − 2Ex∼P,y∼Qk(x, y), hajerin Esfahani and Kuhn, 2018; Zhao and Guan, allowing the plug-in estimator to be used for esti- 2018; Gao and Kleywegt, 2016; Blanchet et al., 2019), mating the MMD from empirical data. The MMD and applied in a large body of literature. Many exist- is an instance of the class of integral probability ing approaches require either the assumptions such as metrics (IPMs), and can equivalently be written as quadratic loss functions or the knowledge of Lipschitz R kµP − µQkH = sup f d(P − Q), where the op- constant or RKHS norm of the loss l, which are often kfkH≤1 timum f ∗ is a witness function (Gretton et al., 2012; hard to obtain in practice; see (Virmaux and Scaman, Sriperumbudur et al., 2012). 2018; Bietti et al., 2019). 3 THEORY 2.2 Reproducing kernel Hilbert spaces We make the following assumption for the proof. A symmetric function k : X ×X → R is called a positive Pn Pn Assumption 3.1. l(θ, ·) is proper, upper semicontin- definite kernel if i=1 i=1 aiajk(xi, xj) ≥ 0 for any n n uous. C is closed convex. ri(KC) 6= ∅. n ∈ N, {xi}i=1 ⊂ X , and {ai}i=1 ⊂ R. Given a positive definite kernel k, there exists a Hilbert space This assumption is general in that it does not require H and a feature map φ: X → H, for which k(x, y) = the knowledge of the Lipschitz constant or the RKHS hφ(x), φ(y)iH defines an inner product on H, where H l(θ, ·) lives in. Generally speaking, the DRO prob- is a space of real-valued functions on X . The space H lem (1) requires two essential elements: an appropriate is called a reproducing kernel Hilbert space (RKHS). ambiguity set that contains meaningful distributions It is equipped with the reproducing property: f(x) = and a sharp reformulation of the min-max problem. hf, φ(x)iH for any f ∈ H, x ∈ X . By convention, we We first present the former in Section 3.1, and then the will denote the canonical feature map as φ(x) := k(x, ·). latter in Section 3.2. Complete proofs of our theory Properties of the functions in H are inherited from the are deferred to the appendix. properties of k. For instance, if k is continuous, then any f ∈ H is continuous. A continuous kernel k on 3.1 Generalized primal formulation a compact metric space X is said to be universal if H is dense in C(X ) (Steinwart and Christmann, 2008, We now present the primal formulation of kernel dis- Section 4.5). A universal H can thus be considered tributionally robust optimization (Kernel DRO) as a a large RKHS since any continuous function can be generalization of existing DRO frameworks. H approximated arbitrarily well by a function in . An Z example of a universal kernel is the Gaussian kernel (P ) := min sup l(θ, ξ) dP (ξ):  2  kx−yk2 θ P,µ k(x, y) = exp − 2 defined on X where σ > 0 is 2σ Z  the bandwidth parameter. φ dP = µ, P ∈ P, µ ∈ C , (2) RKHSs first gained widespread attention following the advent of the kernelized support vector machine (SVM) where H is an RKHS whose feature map is φ. Both R for classification problems (Cortes and Vapnik, 1995; sides of the constraint φ dP = µ are functions in Boser et al., 1992; Schölkopf et al., 2000). More recently, H. Note µ can be viewed as a generalized moment the use of RKHSs has been extended to manipulat- vector, which is constrained to lie within the set C ⊆ ing and comparing probability distributions via kernel H, referred to as an (RKHS) ambiguity set. Let us mean embedding (Smola et al., 2007). Given a distri- denote the set of all feasible distributions in (2) as R bution P , and a (positive definite) kernel k, the kernel KC = {P : φ dP = µ, µ ∈ C,P ∈ P}, i.e., KC is the R mean embedding of P is defined as µP := k(x, ·) dP . usual ambiguity set. Intuitively, the set C restricts the If Ex∼P [k(x, x)] < ∞, then µP ∈ H (Smola et al., RKHS embeddings of distributions in the ambiguity Kernel Distributionally Robust Optimization

resulting in conservativeness. In the extreme, if we choose the smallest possible RKHS H = {0}, then the space does not contain functions to separate any distinct distributions. This renders Kernel DRO (4) overly conservative since we can only choose f = 0 in (a) RKHS ambiguity sets C (4) — it is precisely reduced to worst-case RO. On the other extreme, the next example shows the downside of function spaces that are too large. Example 3.3 (DRO with large function space). Sup- pose H is the space of all bounded measurable functions, the metric induced by H becomes the total variation (b) Interpretation of Kernel DRO distance (Sriperumbudur et al., 2011). While the in- duced topology is strong, (4) has a trivial solution Figure 1: (a): Geometric intuition for choosing am- f = l, f0 = 0. By plugging this solution into (4), we biguity set C in H such as norm-ball, polytope, and recover (2). Hence the reformulation becomes mean- Minkowski sum of sets. The scattered points are the ingless. embeddings of empirical samples. See Table 3 for more examples. (b): Geometric interpretation of Kernel We distinguish between DRO without metrics, e.g., mo- DRO (4). The (red) curve depicts f0 + f, which ma- ment constraints and SVMs, and DRO with probability jorizes l(θ, ·) (black). The horizontal axis is ξ. The metric or divergence, e.g., Wasserstein metric. Let us dashed lines denote the boundary of the domain X . first examine an instance of the former using Kernel DRO. We return to the latter at the end of this section. set KC. In this paper, we take a geometric perspective Example 3.4 (Reduction to DRO with moment con- to construct C using convex sets in H. Given data straints). Kernel DRO with the second-order polyno- N k (x, y) := (1 + x>y)2 samples {ξi}i=1, we outline various choices for C in the mial kernel 2 and a singleton left column of Table 1 (and 3 in the appendix), and ambiguity set C = {µPˆ } robustifies against all distri- illustrate our intuition in Figure 1. butions sharing the first two moments with Pˆ. This is equivalent to DRO with known first two moments, To better understand our unifying formulation, let us such as in (Delage and Ye, 2010; Scarf, 1958). More examine the celebrated SVM through the lens of our generally, the choice of the pth-order polynomial kernel generalized formulation. > p kp(x, y) := (1+x y) corresponds to DRO with known Example 3.1 (SVM as generalized DRO). Let us con- first p moments. sider SVM for regression, without using slack variables or regularization for simplicity. This can be formulated If H is associated with a universal kernel (e.g., Gaus- as optimizing the loss minf∈H maxi[|yi − f(xi)| − η]+ sian), it is large since H is dense in the space of con- where η > 0 is the parameter for the hinge loss. This tinuous functions (cf. (Sriperumbudur et al., 2011)). can be seen as the generalized DRO Then the induced topology (MMD) is strong enough Z to separate distinct probability measures. Meanwhile, min sup [|y − f(x)| − η]+dP (x, y), RKHS allows for efficient computation using tools from f∈H P ∈K kernel methods, as shown in Section 4. Therefore, our insight is that universal RKHSs are large enough for where the ambiguity set is given by the polytope K = DRO applications. clconv{δξ1 , . . . , δξN }, ξi = (xi, yi). Remark. For the RKHS associated with the Gaus- Let us now consider a small RKHS to understand the sian kernel, the diameter of the space can be com- effect of different RKHSs on Kernel DRO. puted: ∀p, q, kµp − µqkH ≤ kµpkH + kµqkH ≤ p Example 3.2 (DRO with non-universal kernels). Con- 2 supx,y k(x, y) = 2. Hence, if  ≥ 2, C contains ˆ 2 sider distributions P = N (0, 1),Qv = N (0, v ) and all probability distributions. Then, Kernel DRO is X H1 induced by the linear kernel k1(x, y) := xy. again reduced to worst-case RO on domain . H1 is small since it only contains linear functions. ˆ We now turn to DRO with a generalized class of integral MMDk (P,Qv) = 0, ∀v 6= 0 since they share the first 1 probability metrics (IPM). moment. Therefore, any C that contains µPˆ also con- tains all the distributions in {µQ , v 6= 0}. Example 3.5 (Generalization to IPM-DRO). Suppose v ˆ R ˆ dF (P, P ) := supf∈F fd(P − P ) is the IPM defined This example shows that small RKHSs force Kernel by some function class F. The IPM-DRO primal for- DRO to robustify against a large set of distributions, mulation is given by Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

Table 1: Examples of support functions for Kernel DRO. See Table 3 for more details.

∗ RKHS ambiguity set C Support function δC(f) 1 PN RKHS norm-ball C = {µ: kµ − µPˆ kH ≤ } N i=1 f(ξi) + kfkH Polytope C = conv{φ(ξ1), . . . , φ(ξN )} maxi f(ξi) (scenario optimization; SVMs with no slack) C = PN C PN δ∗ (f) Minkowski sum i=1 i i=1 Ci Whole space C = H 0 if f = 0, ∞ otherwise (worst-case RO (Ben-Tal et al., 2009))

Z in Figure 2. In our generalized duality theorem, this min sup l(θ, ξ) dP (ξ). (3) separating hyperplane is determined by the witness θ ˆ ∗ dF (P,P )≤ function f , which is the optimal dual variable in (4). See the appendix for the full proof. If we choose the class F = {f : kfkH ≤ 1}, we recover Kernel DRO with the RKHS norm-ball set in Table 1. Similarly, F = {f : lip(f) ≤ 1} recovers the (type- 1) Wasserstein-DRO. This puts Wasserstein-DRO and Kernel DRO into a unified perspective.

3.2 Generalized duality theorem Figure 2: Illustration of a separating hyperplane in H We now present the main theorem of this paper, the Theorem 3.1 generalizes the classical bounds in gen- generalized duality theorem of Kernel DRO (2). eralized moment problems (Isii, 1962; Lasserre, 2002; Theorem 3.1 (Generalized Duality). Under As- Shapiro, 2001; Bertsimas and Popescu, 2005; Popescu, sumption 3.1, (2) is equivalent to 2005; Vandenberghe et al., 2007; Van Parys et al., 2016) to infinitely many moments using RKHSs. A distinc- ∗ (D) := min f0 + δC(f) tion between Theorem 3.1 and other DRO approaches θ,f0∈ ,f∈H R (4) is that it uses the density of universal RKHSs to find subject to l(θ, ξ) ≤ f + f(ξ), ∀ξ ∈ X 0 a surrogate which can sharply bound the worst-case ∗ risk. This means that we do not require the loss l(θ, ·) where δ (f) := sup hf, µiH is the support function C µ∈C to be affine, quadratic, or living in a known RKHS, of C, i.e., (P ) = (D), strong duality holds for the inner moment problem for any θ point-wise. nor do we require the knowledge of Lipschitz constant or RKHS norm of l(θ, ·). To our knowledge, existing The theorem holds regardless of the dependency of works typically require one of such assumptions. l on θ, e.g., non-convexity. If l is convex in θ, then Moreover, Theorem 3.1 generalizes existing RO and (4) is a convex program. Formulation (4) has a clear DRO in the sense that it gives us a unifying tool to geometric interpretation: we find a function f0 +f that work with various ambiguity and ambiguity sets, which majorizes l(θ, ·) and subsequently minimize a surrogate may be customized for specific applications. We outline loss involving f0 and f. This is illustrated in Figure 1b. a few closed-form expressions of the support function Note the term duality here refers to the inner moment ∗ δC(f) in Table 1, and more in Table 3. We now return problem. The statement can be further simplified by IPM-DRO with a duality result. replacing f0 +f with f. However, we choose the current Corollary 3.1.1 (IPM-DRO duality). Given the inte- notation for the sake of its explicit connection to RO. ˆ R gral probability metric dF (P, P ) := supf∈F fd(P − ˆ Proof sketch. Our weak duality proof follows stan- P ), a dual program to (3) is given by dard paradigms of Lagrangian relaxation by introduing N 1 X dual variables. Notably, we associate the functional min f0 + λf(ξi) + λ R θ,λ≥0,f0∈ ,f∈F N constraint φ dP = µ with a dual function f ∈ H, R i=1 (5)

which is the decision variable in the dual problem (4). subject to l(θ, ξ) ≤ f0 + λf(ξ), ∀ξ ∈ X . Using the reproducing property of RKHSs and conic duality, we arrive at (4) with weak duality. Our strong The reduction to (4) as a special case can be seen by duality proof is an extension of the conic strong du- replacing λf with f and choosing F = H. ality in Eulidean spaces. We rely on the existance of separating hyperplnes between convex sets in locally We now establish an explicit connection between DRO convex function spaces, e.g., H. See the illustration and stochastic optimization with expectation con- Kernel Distributionally Robust Optimization straint, whose solution methods using stochastic ap- our data set is {0} and the ambiguity set is p proximation are an topic of active research (Lan and C := {µ: kµ − φ(0)kH ≤ }. Let  = 2 − 2/e. We Zhou, 2020; Xu, 2020). consider the loss function l(ξ) = [|θ + ξ| − 1]+ and re- Corollary 3.1.2. (Stochastic optimization with ex- laxing the constraint of (4) to only hold at the empirical pectation constraint) Under the Assumption 3.1, the sample, i.e., optimal value of (2) coincides with that of min f0 + f(0) + kfkH ∗ θ,f∈H,f ∈ min f0 + δC(f) (d) := 0 R θ,f0∈R,f∈H (6) subject to subject to [|θ| − 1]+ ≤ f0 + f(0) subject to Eh (l (θ, ζ) − f0 − λf (ζ)) ≤ 0

∗ ∗ ∗ for some function h that satisfies h(t) = 0 if t ≤ which admits an optimal solution θ = 0, f = 0, f0 = 0 0, h(t) > 0 if t > 0, and ζ ∼ µ whose and the worst-case risk (d) = 0. However, let µP 0 = 1 1 0 probability measure places positive mass on any non- 2 φ(0) + 2 φ(2). It is straightforward to verify P ∈ R ∗ 0 1 ∗ empty open subset of X , i.e., µ(B) > 0, ∀B ⊆ X ,B =6 C, l(θ , ξ) dP (ξ) = 2 > (d), i.e., the solution θ is ∅,B is open. not robust against P 0.

A choice for h is h(·) = [·]+, which is used in the con- ditional value-at-risk (Rockafellar and Uryasev, 2000). 4 COMPUTATION We will see the computational implication of Corol- lary 3.1.2 in Section 4. Given a certain parametrization of the RKHS function We now establish further theoretical results as a con- f, (4) is a semi-infinite program (SIP) (Guerra Vázquez sequence of the generalized duality theorem to help et al., 2008). In the following, we propose two computa- us understand the geometric intuition of how Kernel tional methods that do not require a polynomial loss l or DRO works. By the weak duality (P ) ≤ (D) of (11) the knowledge of its Lipschitz constant. For simplicity, R ∗ and (12), we have l dP ≤ f0 + δC(f). Specifically, if we only derive the formulations for the RKHS-norm- C is the RKHS norm-ball in Table 1 , this inequal- ball ambiguity set, while other formulations are given R 1 PN ity becomes l dP ≤ f0 + N i=1 f(ξi) + kfkH. Its in Table 1, 3. right-hand-side can be seen as a computable bound for the worst-case risk when generalizing to P . This may be useful when the Lipschitz constant of l is not known A batch approach by discretization of SIP. We or hard to obtain, as is often the case in practice. The first consider an approach based on the discretization following insight is a consequence of a generalization method of SIP (Guerra Vázquez et al., 2008). Let complementarity condition us consider an ambiguity set smaller than the RKHS- of the classical of convex M optimization; see the appendix. norm ball of distributions supported on some {ζj}j=1 ⊆ X . Then it suffices to consider the following program, Corollary 3.1.3 (Interpolation property). Given θ, ∗ ∗ ∗ which relaxes the constraint of (4) to finite support. let P , f , f0 be a set of optimal primal-dual solutions ∗ ∗ associated with (P) and (D), then l(θ, ξ) = f0 + f (ξ) ∗ N holds P -almost everywhere. 1 X min f0 + f(ξi) + kfkH θ,f∈H,f0∈ N ∗ ∗ R i=1 (7) Intuitively, this result states that f0 + f interpolates ∗ the loss l(θ, ·) at the support points of P . This is illus- subject to l(θ, ζi) ≤ f(ζj) + f0, j = 1 ...M. trated in Figure 1 (b) and later empirically validated in Figure 3. We can also see that the size of RKHS H ∗ ∗ Note (7) is a convex program if l(θ, ξ) in convex in θ. matters since, if H is small (e.g., H = {0}), f0 + f cannot interpolate the loss l well. On the other hand, We can parametrize the RKHS function f by a wealth of the density of universal RKHS allows the interpolation tools from kernel methods, such as the random features of general loss functions. fˆ(ξ) = w>φˆ(ξ) for large scale problems (Rahimi and Recht, 2008). Alternatively, for small problems, we can It is tempting to approximately solve (4) by relaxing parametrize f by a kernel expansion on the the points the constraint to hold for only the empirical samples, ζi. We provide concrete plug-in forms in the appendix. i.e., l(θ, ξi) ≤ f0 + f(ξi), i = 1 ...N. The following ob- servation cautions us against this. As an interesting by-product of (7), let us derive an Example 3.6 (Counterexample: relaxation of the unconstrained version of (7), which gives rise to a gen- kernel conditional semi-infinite constraint). Let H be a Gaussian eralized risk measure that we term √ value-at-risk RKHS with the bandwidth σ = 2. Suppose (Kernel CVaR). Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

Example 4.1 (Kernel CVaR). Algorithm 1 Stochastic Functional Gradient DRO (SFG-DRO) M 1 X 1: for k = 1, 2,... do K-CVaRα(X) := inf [X−f(ζj)−f0]+ k f∈H,f0∈ αM 2: Sample mini-batch data {ξi } := {xi, yi}. Sam- R j=1 ple {ζi} from some proposing distribution. N 1 X 3: Approximate f, e.g., by random Fourier feature + f0 + f(ξi) + kfkH. (8) ˆ k > ˆ k N f(ξi ) = w φ(ξi ). i=1 4: Estimate the stochastic functional gradient of M N the objective and constraint in (9) using (10). If f = 0 and {ζj}j=1 = {ξi}i=1, then (8) is reduced to the classical CVaR (Rockafellar and Uryasev, 2000). 5: Update θ, f0, f using the functional gradient with any SA routine with expectation or semi-infinite Program (7) can be readily solved using off-the-shelf constraints, e.g., (Lan and Zhou, 2020; Xu, 2020; convex solvers. However, to scale up to large data Tadić et al., 2006; Wei et al., 2020) . sets, we next develop a stochastic approximation (SA) method for Kernel DRO. on the generalized duality Theorem 3.1. This interplay Stochastic functional gradient DRO. We now between the primal (measures) and dual (functions) is present our SA approach enabled by Theorem 3.1 by em- the essence of our theory. ploying two key tools: 1) scalable approximate RKHS features, such as random Fourier features (Rahimi and 5 NUMERICAL STUDIES Recht, 2008; Dai et al., 2014; Carratino et al., 2018), and 2) stochastic approximation with semi-infinite and expectation constraints (Tadić et al., 2006; Lan and This section showcases the applicability of Kernel DRO Zhou, 2020; Baes et al., 2011; Xu, 2020). (and hence SFG-DRO) and discusses the robustness- optimality trade-off. Our purpose is not to bench- Let us summon Corollary 3.1.2 to formulate a stochastic mark state-of-art performances or to demonstrate program with expectation constraint. the superiority of a specific algorithm. Indeed, we

N believe both RO and DRO are elegant theoretical 1 X frameworks that have their specific use cases. We min f0 + f(ξi) + kfkH θ,f0∈ ,f∈H N R i=1 (9) note that our theory can be applied to a broader scope of applications than the examples here, such subject to E[l (θ, ζ) − f0 − λf (ζ)]+ ≤ 0 as stochastic optimal control. See the appendix for where ζ follows a certain proposal distribution on X , more experimental results. The code is available at e.g., uniform or by adaptive sampling. An alternative https://github.com/jj-zhu/kdro. is to directly solve (4) using SA techniques with SI constraints, such as (Tadić et al., 2006; Wei et al., Distributionally robust solution to uncertain 2020). (4), (6), and (9) are all convex in function f. least squares. We first consider a robust least We can compute the functional gradient by squares problem adapted from (El Ghaoui and Le- bret, 1997), which demonstrated an important appli- f ∇f f = φ, ∇f kfkH = . (10) cation of RO to statistical learning historically. (See kfkH also (Boyd et al., 2004, Ch. 6.4).) The task is to 2 minimize the objective kAθ − bk2 w.r.t. θ. A is mod- When used with approximate features of the form eled by A(ξ) = A0 + ξA1, where ξ ∈ X is uncertain, ˆ > ˆ ˆ 10×10 10 f(ξ) = w φ(ξ), we further have ∇wf(ξ) = X = [−1, 1], and A0,A1 ∈ R , b ∈ R are given. ˆ ˆ φ(ξ), ∇wkfkH = w/kwk2. We outline our stochastic We compare Kernel DRO against using (a) empirical functional gradient DRO (SFG-DRO) in Algorithm 1. risk minimization (ERM; also known as sample average 1 PN 2 approximation) that minimizes kA(ξi) θ − bk , Compared with many batch-setting DRO approaches, N i=1 2 (b) worst-case RO via SDP from (El Ghaoui and SFG-DRO can be used with general model classes, such Lebret, 1997). We consider a data-driven set- as neural networks, and is applicable to a broad class of N ting with given samples {ξi} with the Kernel optimization and modern learning tasks. The conver- i=1 DRO formulation minθ maxP ∈P,µ∈C Eξ∼P kA(ξ) θ − gence guarantee follows that of the specific SA routine 2 R bk subject to φdP = µ, where we choose the ambi- used in Step 5 of the algorithm. It is worth noting that, 2 guity set to be the -norm-ball in the RKHS (Table 1). when used with a primal SA approach such as (Lan N and Zhou, 2020), SFG-DRO completely operates in the Empirical samples {ξi}i=1(N = 10) are generated dual space (an RKHS) since Kernel DRO (4) is based uniformly from [−0.5, 0.5]. We then apply Kernel Kernel Distributionally Robust Optimization

(a) Uncertain least squares loss (b) Geometric interpretation (c) MNIST classification error

(d) (Left) unperturbed data (Center) ERM classification result (red indicates errors) (Right) SFG-DRO (Kernel DRO)

Figure 3: Uncertain least squares. (a) This plot depicts the test loss of algorithms. All error bars are in standard error. We ran 10 independent trials. In each trial, we solved Kernel DRO to obtain θ ∗ and tested it on a test dataset of 500 samples. We then vary the perturbation ∆ from 0 to 4. (b) (red) is the dual optimal ∗ ∗ ∗ solution f0 + f . (black) is the function l(θ , ·). The pink bars depict a worst-case distribution while the blue ∗ ∗ ∗ bars the empirical distribution. We can observe that f0 + f touches loss l(θ , ·) at the support of the worst-case distribution P ∗ (pink dots). Note f ∗ (normalized) can be viewed as a witness function of the two distributions. Classification under perturbation (c) We plot the classification error rate during test time. The x−axis is the perturbation magnitude allowed on the test data. For ERM, PGD, and SFG-DRO (Kernel DRO), we train 5 independent models. Each model is tested on 500 randomly sampled images. (d) We visualize the predictions of ERM and SFG-DRO on the perturbed images with perturbation magnitude ∆ = 0.2. Blue frames indicate correct predictions while the red ones indicate errors.

DRO formulation (7). To test the solution, we cre- emphasize that the deliberate choice of this simple ar- ate a distribution shift by generating test samples from chitecture ablates factors known to implicitly influence [−0.5 · (1 + ∆), 0.5 · (1 + ∆)], where ∆ is a perturbation robustness, such as regularization and dropout. The varying within [0, 4]. Figure 3a shows this comparison. data set contains images of zero and one (i.e., two As the perturbation increases, ERM quickly lost ro- classes). Each image x is represented by x ∈ [0, 1]28×28. bustness. On the other hand, RO is the most robust The test data is perturbed by an unknown disturbance, with the trade-off of being conservative. As expected, i.e., x˜test := x + δ where x ∼ Ptest is the unperturbed Kernel DRO achieves some level of optimality while test data and δ is the perturbation. In the plots, δ is retaining robustness. generated by the PGD algorithm (Madry et al., 2019) using projected gradient descent to find the worst-case We then ran Kernel DRO with fewer empirical samples perturbation within a box {δ : kδk∞ ≤ ∆}}. We com- (N = 5) to show the geometric interpretations. We plot pared SFG-DRO (Kernel DRO) with ERM and PGD the optimal dual solution f ∗ +f ∗ in Figure 3b. Recall it 0 (cf. (Madry et al., 2019; Madry)). Note the overall loss is an over-estimator of the loss l(θ, ·). We solve the inner of PGD is an average loss instead of a worst-case one. moment problem (see appendix) to obtain a worst-case Hence it is already less conservative than RO. We train distribution P ∗. Comparing P ∗ with Pˆ, we can observe a classification model gθ : x 7→ y using SFG-DRO in the adversarial behavior of the worst-case distribution. Algorithm 1, with the SA subroutine of (Lan and Zhou, See the caption for more description. From Figure 3b, 2020). During training, we set the ambiguity size of we can see that the intuition of Kernel DRO is to SFG-DRO as  = 0.5 and domain X to be norm-balls flatten the loss curve using a smooth function. around the training data X = {ζ = X+δ : kδk∞ ≤ 0.5} where X is the training data. Distributionally robust learning under adver- Figure 3d (left) plots unperturbed test samples. Fig- sarial perturbation. We now demonstrate the ure 3c shows the classification error rate as we increase framework of SFG-DRO in Algorithm 1 in a non-convex the magnitude of the perturbation ∆. We observe that setting. For simplicity, we consider a MNIST binary ERM attains good performance when there is no test- classification task with a two-layer neural network. We Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf time perturbation but quickly underperforms as the Acknowledgements noise level increases. PGD is the most robust under large perturbation, but has the worst nominal perfor- We thank Daniel Kuhn for the helpful discussion during mance. SFG-DRO possesses improved robustness while a workshop at IPAM, UCLA. We also thank Yassine its performance under no perturbation does not become Nemmour and Simon Buchholz for sending us their feed- much worse. This is consistent with our theoretical back on the paper draft. During part of this project, insights into RO and DRO. Jia-Jie Zhu was supported by the European Union’s Horizon 2020 research and innovation programme un- der the Marie Skłodowska-Curie grant agreement No 6 OTHER RELATED WORK AND 798321. Moritz Diehl would like to acknowledge the DISCUSSION funding support from the DFG via Project DI 905/3-1 (on robust model predictive control) This paper uses similar techniques of reformulating min-max programs as in (Ben-Tal et al., 2015; Bertsi- References mas et al., 2017), but our ambiguity set is constructed in an RKHS. Duchi et al. (2020) proposed variational Michel Baes, Michael Bürgisser, and Arkadi Ne- approximations to marginal DRO to treat covariate mirovski. A randomized Mirror-Prox method for shift in supervised learning. The authors of (Zhu et al., solving structured large-scale matrix saddle-point 2020) used kernel mean embedding for the inner mo- problems. arXiv:1112.1274 [math], December 2011. ment problem (but not DRO) and proved the statis- Alexander Barvinok. A Course in Convexity, volume 54. tical consistency of the solution. The work of Staib American Mathematical Soc., 2002. and Jegelka (2019) used insights from DRO to moti- vate a regularizer for kernel ridge regression. DRO has Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Ne- Robust Optimization been also applied to Bayesian optimization in (Rontsis mirovski. , volume 28. Princeton et al., 2020; Kirschner et al., 2020), where the latter University Press, 2009. work used MMD ambiguity sets of distributions over Aharon Ben-Tal, Dick den Hertog, Anja De Waege- discrete spaces. In terms of scalability, recent works naere, Bertrand Melenberg, and Gijs Rennen. Ro- such as (Sinha et al., 2020; Li et al., 2019; Namkoong bust Solutions of Optimization Problems Affected by and Duchi, 2016) also explored DRO for modern ma- Uncertain Probabilities. Management Science, 59(2): chine learning tasks. To the best of our knowledge, no 341–357, February 2013. ISSN 0025-1909, 1526-5501. existing work contains the results such as generalized doi: 10.1287/mnsc.1120.1641. ambiguity set constructions in Table 1, 3, generalized Aharon Ben-Tal, Dick den Hertog, and Jean-Philippe duality theory underpinned by Theorem 3.1, or the Vial. Deriving robust counterparts of nonlinear un- stochastic functional gradient algorithm SFG-DRO. certain inequalities. Mathematical Programming, 149 In summary, this paper proves Theorem 3.1 that gen- (1):265–299, February 2015. ISSN 1436-4646. doi: eralizes the classical duality theory in the literature of 10.1007/s10107-014-0750-8. mathematical problem of moments and DRO. Using Dimitris Bertsimas and Ioana Popescu. Optimal In- the density of universal RKHSs, the dual bound in The- equalities in : A Convex Opti- orem 3.1 is sharp while lifting restrictions on the loss mization Approach. SIAM Journal on Optimization, function class. The generalized primal formulations 15(3):780–804, January 2005. ISSN 1052-6234, 1095- shed light on the connection between Kernel DRO and 7189. doi: 10.1137/S1052623401399903. existing robust and stochastic optimization approaches. Finally, the proposed stochastic approximation algo- Dimitris Bertsimas, Nathan Kallus, and Vishal Gupta. rithm SFG-DRO enables the applications of Kernel Data-Driven Robust Optimization. Springer Berlin DRO to modern learning tasks. Heidelberg, 2017. ISBN 1010701711258. doi: 10. 1007/s10107-017-1125-8. The compactness assumption on X can be further ex- tended, as universality can be extended to non-compact Alberto Bietti, Grégoire Mialon, Dexiong Chen, and domains (Sriperumbudur et al., 2011). In the special Julien Mairal. A Kernel Perspective for Regularizing arXiv:1810.00363 [cs, stat] case of RKHS-norm-ball ambiguity sets, choosing the Deep Neural Networks. , size  can be motivated using kernel statistical test- May 2019. ing (Gretton et al., 2012). However, when DRO is used Jose Blanchet, Yang Kang, and Karthyek Murthy. Ro- in the setting where test distributions are perturbed as bust Wasserstein Profile Inference and Applications in our examples, existing statistical guarantees in the to Machine Learning. Journal of Applied Probability, literature for unperturbed settings cannot be directly 56(03):830–857, September 2019. ISSN 0021-9002, applied. This is a topic of future work. 1475-6072. doi: 10.1017/jpr.2019.49. Kernel Distributionally Robust Optimization

Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Rui Gao and Anton J. Kleywegt. Distributionally Vapnik. A training algorithm for optimal margin clas- Robust Stochastic Optimization with Wasserstein sifiers. In Proceedings of the Fifth Annual Workshop Distance. arXiv:1604.02199 [math], July 2016. on Computational Learning Theory , pages 144–152, Arthur Gretton, Karsten M. Borgwardt, Malte J. 1992. Rasch, Bernhard Schölkopf, and Alexander Smola. A Stephen Boyd, Stephen P. Boyd, and Lieven Vanden- kernel two-sample test. Journal of Machine Learning berghe. Convex Optimization. Cambridge University Research, 13(Mar):723–773, 2012. Press, March 2004. ISBN 978-0-521-83378-3. F. Guerra Vázquez, J. J. Rückmann, O. Stein, and G.C. Calafiore and M.C. Campi. The scenario approach G. Still. Generalized semi-infinite programming: to robust control design. IEEE Transactions on A tutorial. Journal of Computational and Applied Automatic Control, 51(5):742–753, May 2006. ISSN Mathematics, 217(2):394–419, August 2008. ISSN 2334-3303. doi: 10.1109/TAC.2006.875041. 0377-0427. doi: 10.1016/j.cam.2007.02.012. Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Keiiti Isii. On sharpness of tchebycheff-type inequalities. Annals of the Institute of Statistical Mathematics Learning with SGD and Random Features. In S. Ben- , 14 gio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa- (1):185–197, December 1962. ISSN 1572-9052. doi: Bianchi, and R. Garnett, editors, Advances in Neu- 10.1007/BF02868641. ral Information Processing Systems 31, pages 10192– Garud N. Iyengar. Robust Dynamic Programming. 10203. Curran Associates, Inc., 2018. Mathematics of Operations Research, 30(2):257–280, 2005. ISSN 0364-765X. Andreas Christmann and Ingo Steinwart. Consistency and robustness of kernel-based regression in convex Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, risk minimization. Bernoulli, 13(3):799–819, August and Andreas Krause. Distributionally Robust 2007. ISSN 1350-7265. doi: 10.3150/07-BEJ5102. Bayesian Optimization. arXiv:2002.09038 [cs, stat], March 2020. John B Conway. A course in functional analysis, vol- ume 96. Springer, 2019. Guanghui Lan and Zhiqiang Zhou. Algorithms for stochastic optimization with function or expectation Corinna Cortes and Vladimir Vapnik. Support-vector constraints. Computational Optimization and Appli- networks. Machine learning, 20(3):273–297, 1995. cations, February 2020. ISSN 0926-6003, 1573-2894. Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, doi: 10.1007/s10589-020-00179-x. Maria-Florina F Balcan, and Le Song. Scalable Jean B. Lasserre. Bounds on measures satisfying mo- Kernel Methods via Doubly Stochastic Gradients. ment conditions. The Annals of Applied Probability, In Z. Ghahramani, M. Welling, C. Cortes, N. D. 12(3):1114–1137, 2002. Lawrence, and K. Q. Weinberger, editors, Advances Jiajin Li, Sen Huang, and Anthony Man-Cho So. in Neural Information Processing Systems 27, pages A First-Order Algorithmic Framework for Wasser- 3041–3049. Curran Associates, Inc., 2014. stein Distributionally Robust Logistic Regression. Erick Delage and Yinyu Ye. Distributionally Robust arXiv:1910.12778 [cs, math, stat], October 2019. Optimization Under Moment Uncertainty with Ap- Aleksander Madry, Aleksandar Makelov, Ludwig Operations Re- plication to Data-Driven Problems. Schmidt, Dimitris Tsipras, and Adrian Vladu. To- search , 58(3):595–612, June 2010. ISSN 0030-364X, wards Deep Learning Models Resistant to Adversar- 1526-5463. doi: 10.1287/opre.1090.0741. ial Attacks. arXiv:1706.06083 [cs, stat], September John Duchi, Peter Glynn, and Hongseok Namkoong. 2019. Statistics of robust optimization: A generalized Zico Kolter and Aleksander Madry. Adversarial Ro- empirical likelihood approach. arXiv preprint bustness - Theory and Practice. http://adversarial- arXiv:1610.03425, 2016. ml-tutorial.org/. John Duchi, Tatsunori Hashimoto, and Hongseok Peyman Mohajerin Esfahani and Daniel Kuhn. Data- Namkoong. Distributionally Robust Losses for La- driven distributionally robust optimization using the tent Covariate Mixtures. arXiv:2007.13982 [cs, stat], Wasserstein metric: Performance guarantees and July 2020. tractable reformulations. Mathematical Program- ming Laurent El Ghaoui and Hervé Lebret. Robust So- , 171(1):115–166, September 2018. ISSN 1436- lutions to Least-Squares Problems with Uncertain 4646. doi: 10.1007/s10107-017-1172-1. Data. SIAM Journal on Matrix Analysis and Ap- Hongseok Namkoong and John C Duchi. Stochas- plications, 18(4):1035–1064, October 1997. ISSN tic Gradient Methods for Distributionally Robust 0895-4798. doi: 10.1137/S0895479896298130. Optimization with f-divergences. In D. D. Lee, Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Gar- US, Boston, MA, 2001. ISBN 978-1-4419-5204-2 978- nett, editors, Advances in Neural Information Pro- 1-4757-3403-4. doi: 10.1007/978-1-4757-3403-4_7. cessing Systems 29, pages 2208–2216. Curran Asso- Alexander Shapiro, Darinka Dentcheva, and Andrzej ciates, Inc., 2016. Ruszczyński. Lectures on Stochastic Programming: Arnab Nilim and Laurent El Ghaoui. Robust Con- Modeling and Theory. SIAM, 2014. trol of Markov Decision Processes with Uncertain Aman Sinha, Hongseok Namkoong, Riccardo Volpi, Transition Matrices. Operations Research, 53(5):780– and John Duchi. Certifying Some Distributional 798, October 2005. ISSN 0030-364X, 1526-5463. doi: Robustness with Principled Adversarial Training. 10.1287/opre.1050.0216. arXiv:1710.10571 [cs, stat], May 2020. Imre Pólik and Tamás Terlaky. A Survey of the S- Lemma. SIAM Review, 49(3):371–418, January Alex Smola, Arthur Gretton, Le Song, and Bernhard 2007. ISSN 0036-1445, 1095-7200. doi: 10.1137/ Schölkopf. A hilbert space embedding for distribu- International Conference on Algorithmic S003614450444614X. tions. In Learning Theory, pages 13–31. Springer, 2007. Ioana Popescu. A Semidefinite Programming Approach to Optimal-Moment Bounds for Convex Classes of Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert Distributions. Mathematics of Operations Research, R. G. Lanckriet. Universality, Characteristic Kernels 30(3):632–657, August 2005. ISSN 0364-765X, 1526- and RKHS Embedding of Measures. Journal of 5471. doi: 10.1287/moor.1040.0137. Machine Learning Research, 12(Jul):2389–2410, 2011. ISSN ISSN 1533-7928. Ali Rahimi and Benjamin Recht. Random Features for Large-Scale Kernel Machines. In J. C. Platt, Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur D. Koller, Y. Singer, and S. T. Roweis, editors, Ad- Gretton, Bernhard Schölkopf, and Gert R. G. Lanck- vances in Neural Information Processing Systems 20, riet. On the empirical estimation of integral prob- pages 1177–1184. Curran Associates, Inc., 2008. ability metrics. Electronic Journal of Statistics, 6: R. Tyrrell Rockafellar. Convex Analysis. Number 28. 1550–1599, 2012. ISSN 1935-7524. doi: 10.1214/ Princeton university press, 1970. 12-EJS722. R. Tyrrell Rockafellar and Stanislav Uryasev. Opti- Matthew Staib and Stefanie Jegelka. Distribution- mization of conditional value-at-risk. Journal of risk, ally robust optimization and generalization in kernel 2:21–42, 2000. methods. In Advances in Neural Information Pro- cessing Systems, pages 9131–9141, 2019. W W Rogosinski. Moments of Non-Negative Mass. page 28. Ingo Steinwart and Andreas Christmann. Support Vec- tor Machines. Springer Science & Business Media, Nikitas Rontsis, Michael A. Osborne, and Paul J. 2008. Goulart. Distributionally Ambiguous Optimization for Batch Bayesian Optimization. Journal of Ma- Vladislav B. Tadić, Sean P. Meyn, and Roberto Tempo. chine Learning Research, 21(149):1–26, 2020. ISSN Randomized Algorithms for Semi-Infinite Program- 1533-7928. ming Problems. In Giuseppe Calafiore and Fab- Probabilistic and Randomized Herbert Scarf. A min-max solution of an inventory rizio Dabbene, editors, Methods for Design under Uncertainty problem. Studies in the mathematical theory of in- , pages 243– ventory and production, 1958. 261. Springer, London, 2006. ISBN 978-1-84628-095- 5. doi: 10.1007/1-84628-095-8_9. B. Schölkopf, R. Herbrich, and A. J. Smola. A gen- eralized representer theorem. In D. Helmbold and Bart P. G. Van Parys, Paul J. Goulart, and Daniel R. Williamson, editors, Annual Conference on Com- Kuhn. Generalized Gauss inequalities via semidefi- putational Learning Theory, number 2111 in Lecture nite programming. Mathematical Programming, 156 Notes in Computer Science, pages 416–426, Berlin, (1-2):271–302, March 2016. ISSN 0025-5610, 1436- 2001. Springer. 4646. doi: 10.1007/s10107-015-0878-1. Bernhard Schölkopf, Alex J. Smola, Robert C. Lieven. Vandenberghe, Stephen. Boyd, and Kather- Williamson, and Peter L. Bartlett. New support ine. Comanor. Generalized Chebyshev Bounds vector algorithms. Neural computation, 12(5):1207– via Semidefinite Programming. SIAM Review, 49 1245, 2000. (1):52–64, January 2007. ISSN 0036-1445. doi: 10.1137/S0036144504440543. Alexander Shapiro. On Duality Theory of Conic Lin- ear Problems. In Panos Pardalos, Miguel Á. Gob- Aladin Virmaux and Kevin Scaman. Lipschitz regular- erna, and Marco A. López, editors, Semi-Infinite ity of deep neural networks: Analysis and efficient Programming, volume 57, pages 135–165. Springer estimation. In S. Bengio, H. Wallach, H. Larochelle, Kernel Distributionally Robust Optimization

K. Grauman, N. Cesa-Bianchi, and R. Garnett, ed- itors, Advances in Neural Information Processing Systems 31, pages 3835–3844. Curran Associates, Inc., 2018. Zizhuo Wang, Peter W. Glynn, and Yinyu Ye. Like- lihood robust optimization for data-driven prob- lems. Computational Management Science, 13(2): 241–261, April 2016. ISSN 1619-697X, 1619-6988. doi: 10.1007/s10287-015-0240-3. Bo Wei, William B. Haskell, and Sixiang Zhao. The CoMirror algorithm with random constraint sam- pling for convex semi-infinite programming. Annals of Operations Research, September 2020. ISSN 1572- 9338. doi: 10.1007/s10479-020-03766-7. Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pages 5286–5295. PMLR, 2018. Huan Xu, Constantine Caramanis, and Shie Mannor. Robustness and regularization of support vector ma- chines. Journal of machine learning research, 10(7), 2009. Yangyang Xu. Primal-Dual Stochastic Gradient Method for Convex Programs with Many Functional Constraints. SIAM Journal on Optimization, 30(2): 1664–1692, January 2020. ISSN 1052-6234, 1095- 7189. doi: 10.1137/18M1229869. Chaoyue Zhao and Yongpei Guan. Data-driven risk- averse stochastic optimization with Wasserstein met- ric. Operations Research Letters, 46(2):262–267, March 2018. ISSN 01676377. doi: 10.1016/j.orl. 2018.01.011. Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, and Bernhard Schölkopf. Worst-Case Risk Quantification under Distributional Ambiguity using Kernel Mean Embedding in Moment Problem. arXiv:2004.00166 [cs, eess, math], March 2020. Steve Zymler, Daniel Kuhn, and Berç Rustem. Distribu- tionally robust joint chance constraints with second- order moment information. Mathematical Program- ming, 137(1-2):167–198, February 2013. ISSN 0025- 5610, 1436-4646. doi: 10.1007/s10107-011-0494-7.