Kernel Distributionally Robust Optimization

Jia-Jie Zhu Wittawat Jitkrittum Empirical Inference Department Empirical Inference Department Max Planck Institute for Intelligent Systems Max Planck Institute for Intelligent Systems Tübingen, Germany Tübingen, Germany [email protected] Currently at Google Research, NYC, USA [email protected]

Moritz Diehl Bernhard Schölkopf Department of Microsystems Engineering Empirical Inference Department & Department of Mathematics Max Planck Institute for Intelligent Systems University of Freiburg Tübingen, Germany Freiburg, Germany [email protected] [email protected]

Abstract 1 INTRODUCTION

We propose kernel distributionally robust op- Imagine a hypothetical scenario in the illustrative figure timization (Kernel DRO) using insights from where we want to arrive at a destination while avoiding the robust optimization theory and functional unknown obstacles. A worst-case robust optimization analysis. Our method uses reproducing kernel (RO) (Ben-Tal et al., 2009) approach is then to avoid Hilbert spaces (RKHS) to construct a wide the entire unsafe area (left, blue). Suppose we have range of convex ambiguity sets, which can be historical locations of the obstacles (right, dots). We generalized to sets based on integral probabil- may choose to avoid only the convex polytope that ity metrics and finite-order bounds. contains all the samples (pink). This data-driven robust This perspective unifies multiple existing ro- decision-making idea improves efficiency while retaining bust and stochastic optimization methods. robustness. We prove a theorem that generalizes the clas- sical duality in the mathematical problem of moments. Enabled by this theorem, we re- formulate the maximization with respect to measures in DRO into the dual program that The concept of distributional ambiguity concerns the searches for RKHS functions. Using universal uncertainty of uncertainty — the underlying probability arXiv:2006.06981v3 [math.OC] 27 Feb 2021 RKHSs, the theorem applies to a broad class measure is only partially known or subject to change. of loss functions, lifting common limitations This idea is by no means a new one. The classical such as polynomial losses and knowledge of moment problem concerns itself with estimating the R the Lipschitz constant. We then establish a worst-case risk expressed by maxP ∈K l dP where l is connection between DRO and stochastic op- some loss function. The constraint P ∈ K describes timization with expectation constraints. Fi- the distribution ambiguity, i.e., P is only known to live nally, we propose practical algorithms based within a subset K of probability measures. The solution on both batch convex solvers and stochastic to the moment problem gives the risk under some functional gradient, which apply to general worst-case distribution within K. To make decisions optimization and machine learning tasks. that will minimize this worst-case risk is the idea of distributionally robust optimization (DRO) (Delage and This work appeared in the proceedings of the 24th International Conference on Artificial Intelligence and Ye, 2010; Scarf, 1958). Statistics (AISTATS) 2021, San Diego, California, USA. Many of today’s learning tasks suffer from various man- PMLR: Volume 130. ifestations of distributional ambiguity — e.g., covariate Kernel Distributionally Robust Optimization

∗ shift, adversarial attacks, simulation to reality trans- tion. δC(f) := supµ∈Chf, µiH is the function of fer — phenomena that are caused by the discrepancy C. SN denotes the N-dimensional simplex. ri(·) denotes between training and test distributions. Kernel meth- the relative interior of a set. A function f is upper semi-

ods are known to possess robustness properties, e.g., continuous on X if lim supx→x0 f(x) ≤ f(x0), ∀x0 ∈ X ; (Christmann and Steinwart, 2007; Xu et al., 2009). it is proper if it is not identically −∞. When there is no However, this robustness only applies to kernelized ambiguity, we simplify the loss function notation l(θ, ·) models. This paper extends the robustness of kernel by using l to indicate that results hold for θ point-wise. methods using the robust counterpart formulation tech- niques (Ben-Tal et al., 2009) as well as the principled 2.1 Robust and distributionally robust conic duality theory (Shapiro, 2001). We term our optimization approach kernel distributionally robust optimization (Kernel DRO), which can robustify general optimiza- Robust optimization (RO) (Ben-Tal et al., 2009) stud- tion solutions not limited to kernelized models. ies mathematical decision-making under uncertainty. It solves the min-max problem (omitting constraints) The main contributions of this paper are: minθ supξ∈X l(θ, ξ), where l(θ, ξ) denotes a general loss 1. We rigorously prove the generalized duality theo- function, θ is the decision variable, and ξ is a variable rem (Theorem 3.1) that reformulates general DRO representing the uncertainty. Intuitively, RO makes into a convex dual problem searching for RKHS the decision assuming an adversarial scenario, as re- functions, lifting common limitations of DRO on flected in taking supremum w.r.t. ξ. For this reason, the loss functions, such as the knowledge of Lip- it is often referred to as the worst-case RO. Recently, schitz constant. The theorem also constitutes a RO has been applied to the setting of adversarially generalization of the duality results from the liter- robust learning, e.g., in (Madry et al., 2019; Wong and ature of mathematical problem of moments. Kolter, 2018), which we will visit in this paper. In the 2. We use RKHSs to construct a wide range of convex optimization literature, a typical approach to solving ambiguity sets (in Table 1, 3), including sets based RO is via reformulating the min-max program using on integral probability metrics (IPM) and finite- duality to obtain a single minimization problem. In order moment bounds. This perspective unifies contrast, distributionally robust optimization (DRO) existing RO and DRO methods. minimizes the expected loss assuming the worst-case 3. We propose computational algorithms based on distribution: both convex solvers and stochastic approximation,  Z  which can be applied to robustify general optimiza- min sup l(θ, ξ) dP (ξ) , (1) θ P ∈K tion and machine learning models not limited to kernelized or known-Lipschitz-constant ones. where K ⊆ P, called the ambiguity set, is a subset 4. Finally, we establish an explicit connection be- of distributions, e.g., all distributions with the given tween DRO and stochastic optimization with mean and . Compared with RO, DRO only expectation constraints. This leads to a novel robustifies the solution against a subset K of distribu- stochastic functional gradient DRO (SFG-DRO) tions on X and is, therefore, less conservative (since R algorithm which can scale up to modern machine supP ∈K{ l(θ, ξ) dP (ξ)} ≤ supξ∈X l(θ, ξ)). learning tasks. The inner problem of (1), historically known as the In addition, we give complete self-contained proofs in problem of moments traced back at least to Thomas the appendix that shed light on the connection be- Joannes Stieltjes, estimates the worst-case risk under tween RKHSs, conic duality, and DRO. We also show uncertainty in distributions. The modern approaches, that universal RKHSs are large enough for DRO from pioneered by the work of (Isii, 1962) (see also (Lasserre, the perspective of functional analysis through concrete 2002; Shapiro, 2001; Bertsimas and Popescu, 2005; examples. Popescu, 2005; Vandenberghe et al., 2007; Van Parys et al., 2016; Zhu et al., 2020)), typically seek a sharp 2 BACKGROUND upper bound via duality. This duality, rigorously jus- tified in (Shapiro, 2001), is different from that in the Euclidean space because infinite-dimensional convex Notation. X ⊂ Rd denotes the input domain, which sets can become pathological. Using that methodol- is assumed to be compact unless otherwise specified. ogy, we can reformulate DRO (1) into a single solvable P := P(X ) denotes the set of all Borel probability minimization problem. measures on X . We use Pˆ to denote the empirical dis- ˆ PN 1 tribution P = i=1 N δξi , where δ is a Dirac measure Existing DRO approaches can be grouped into three N and {ξi}i=1 are data samples. We refer to the function main categories by the type of ambiguity sets used. δC(x) := 0 if x ∈ C, ∞ if x∈ / C, as the indicator func- DRO with (finite-order) moment constraints has been Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

studied in (Delage and Ye, 2010; Scarf, 1958; Zymler 2012). With a universal H, given two distributions P,Q, et al., 2013). The authors of (Ben-Tal et al., 2013; kµP − µQkH defines a metric. This quantity is known Iyengar, 2005; Nilim and El Ghaoui, 2005; Wang et al., as the maximum mean discrepancy (MMD) (Gretton p 2016; Duchi et al., 2016) studied DRO using likelihood et al., 2012). With kfkH := hf, fiH and the repro- 2 bounds as well as φ−divergence. Wasserstein-distance- ducing property, it can be shown that kµP − µQkH = 0 0 based DRO has been studied by the authors of (Mo- Ex,x0∼P k(x, x ) + Ey,y0∼Qk(y, y ) − 2Ex∼P,y∼Qk(x, y), hajerin Esfahani and Kuhn, 2018; Zhao and Guan, allowing the plug-in estimator to be used for esti- 2018; Gao and Kleywegt, 2016; Blanchet et al., 2019), mating the MMD from empirical data. The MMD and applied in a large body of literature. Many exist- is an instance of the class of integral probability ing approaches require either the assumptions such as metrics (IPMs), and can equivalently be written as R quadratic loss functions or the knowledge of Lipschitz kµP − µQkH = supkfkH≤1 f d(P − Q), where the op- constant or RKHS norm of the loss l, which are often timum f ∗ is a witness function (Gretton et al., 2012; hard to obtain in practice; see (Virmaux and Scaman, Sriperumbudur et al., 2012). 2018; Bietti et al., 2019). 3 THEORY 2.2 Reproducing kernel Hilbert spaces We make the following assumption for the proof. A symmetric function k : X ×X → R is called a positive Pn Pn definite kernel if i=1 i=1 aiajk(xi, xj) ≥ 0 for any Assumption 3.1. l(θ, ·) is proper, upper semicontin- n n n ∈ N, {xi}i=1 ⊂ X , and {ai}i=1 ⊂ R. Given a uous. C is closed convex. ri(KC) 6= ∅. k positive definite kernel , there exists a Hilbert space This assumption is general in that it does not require H φ: X → H k(x, y) = and a feature map , for which the knowledge of the Lipschitz constant or the RKHS hφ(x), φ(y)i H H H defines an inner product on , where l(θ, ·) lives in. Generally speaking, the DRO prob- X H is a space of real-valued functions on . The space lem (1) requires two essential elements: an appropriate is called a reproducing kernel Hilbert space (RKHS). ambiguity set that contains meaningful distributions reproducing property f(x) = It is equipped with the : and a sharp reformulation of the min-max problem. hf, φ(x)i f ∈ H, x ∈ X H for any . By convention, we We first present the former in Section 3.1, and then the φ(x) := k(x, ·) will denote the canonical feature map as . latter in Section 3.2. Complete proofs of our theory H Properties of the functions in are inherited from the are deferred to the appendix. properties of k. For instance, if k is continuous, then any f ∈ H is continuous. A continuous kernel k on 3.1 Generalized primal formulation a compact metric space X is said to be universal if H is dense in C(X ) (Steinwart and Christmann, 2008, We now present the primal formulation of kernel dis- Section 4.5). A universal H can thus be considered tributionally robust optimization (Kernel DRO) as a a large RKHS since any continuous function can be generalization of existing DRO frameworks. approximated arbitrarily well by a function in H. An example of a universal kernel is the Gaussian kernel Z  2  kx−yk2 k(x, y) = exp − 2 defined on X where σ > 0 is (P ) := min sup l(θ, ξ) dP (ξ): 2σ θ P,µ the bandwidth parameter. Z  φ dP = µ, P ∈ P, µ ∈ C , RKHSs first gained widespread attention following the (2) advent of the kernelized support vector machine (SVM) for classification problems (Cortes and Vapnik, 1995; where H is an RKHS whose feature map is φ. Both R Boser et al., 1992; Schölkopf et al., 2000). More recently, sides of the constraint φ dP = µ are functions in the use of RKHSs has been extended to manipulat- H. Note µ can be viewed as a generalized moment ing and comparing probability distributions via kernel vector, which is constrained to lie within the set C ⊆ mean embedding (Smola et al., 2007). Given a distri- H, referred to as an (RKHS) ambiguity set. Let us bution P , and a (positive definite) kernel k, the kernel denote the set of all feasible distributions in (2) as R R mean embedding of P is defined as µP := k(x, ·) dP . KC = {P : φ dP = µ, µ ∈ C,P ∈ P}, i.e., KC is the If Ex∼P [k(x, x)] < ∞, then µP ∈ H (Smola et al., usual ambiguity set. Intuitively, the set C restricts the 2007, Section 1.2). The reproducing property allows RKHS embeddings of distributions in the ambiguity one to easily compute the expectation of any function set KC. In this paper, we take a geometric perspective f ∈ H since Ex∼P [f(x)] = hf, µP iH. Embedding distri- to construct C using convex sets in H. Given data N butions into H also allows one to measure the distance samples {ξi}i=1, we outline various choices for C in the between distributions in H. If k is universal, then the left column of Table 1 (and 3 in the appendix), and mean map P 7→ µP is injective on P (Gretton et al., illustrate our intuition in Figure 1. Kernel Distributionally Robust Optimization

Table 1: Examples of support functions for Kernel DRO. See Table 3 for more details.

∗ RKHS ambiguity set C Support function δC(f) 1 PN RKHS norm-ball C = {µ: kµ − µPˆ kH ≤ } N i=1 f(ξi) + kfkH Polytope C = conv{φ(ξ1), . . . , φ(ξN )} maxi f(ξi) (scenario optimization; SVMs with no slack) C = PN C PN δ∗ (f) Minkowski sum i=1 i i=1 Ci Whole space C = H 0 if f = 0, ∞ otherwise (worst-case RO (Ben-Tal et al., 2009))

moment. Therefore, any C that contains µPˆ also con-

tains all the distributions in {µQv , v 6= 0}.

This example shows that small RKHSs force Kernel DRO to robustify against a large set of distributions, resulting in conservativeness. In the extreme, if we (a) RKHS ambiguity sets C choose the smallest possible RKHS H = {0}, then the space does not contain functions to separate any distinct distributions. This renders Kernel DRO (4) overly conservative since we can only choose f = 0 in (4) — it is precisely reduced to worst-case RO. On the (b) Interpretation of Kernel DRO other extreme, the next example shows the downside of function spaces that are too large. (a) Figure 1: : Geometric intuition for choosing am- Example 3.3 (DRO with large function space). Sup- C H biguity set in such as norm-ball, polytope, and pose H is the space of all bounded measurable functions, Minkowski sum of sets. The scattered points are the the metric induced by H becomes the total variation embeddings of empirical samples. See Table 3 for more distance (Sriperumbudur et al., 2011). While the in- (b) examples. : Geometric interpretation of Kernel duced topology is strong, (4) has a trivial solution DRO (4). The (red) curve depicts f0 + f, which ma- f = l, f0 = 0. By plugging this solution into (4), we jorizes l(θ, ·) ξ (black). The horizontal axis is . The recover (2). Hence the reformulation becomes mean- X dashed lines denote the boundary of the domain . ingless.

We distinguish between DRO without metrics, e.g., mo- To better understand our unifying formulation, let us ment constraints and SVMs, and DRO with probability examine the celebrated SVM through the lens of our metric or divergence, e.g., Wasserstein metric. Let us generalized formulation. first examine an instance of the former using Kernel Example 3.1 (SVM as generalized DRO). Let us con- DRO. We return to the latter at the end of this section. sider SVM for regression, without using slack variables Example 3.4 (Reduction to DRO with moment con- or regularization for simplicity. This can be formulated straints). Kernel DRO with the second-order polyno- > 2 as optimizing the loss minf∈H maxi[|yi − f(xi)| − η]+ mial kernel k2(x, y) := (1 + x y) and a singleton η > 0 where is the parameter for the hinge loss. This ambiguity set C = {µPˆ } robustifies against all distri- can be seen as the generalized DRO butions sharing the first two moments with Pˆ. This Z is equivalent to DRO with known first two moments, min sup [|y − f(x)| − η]+dP (x, y), such as in (Delage and Ye, 2010; Scarf, 1958). More f∈H P ∈K generally, the choice of the pth-order polynomial kernel > p kp(x, y) := (1+x y) corresponds to DRO with known where the ambiguity set is given by the polytope K = first p moments. clconv{δξ1 , . . . , δξN }, ξi = (xi, yi). If H is associated with a universal kernel (e.g., Gaus- Let us now consider a small RKHS to understand the sian), it is large since H is dense in the space of con- effect of different RKHSs on Kernel DRO. tinuous functions (cf. (Sriperumbudur et al., 2011)). Example 3.2 (DRO with non-universal kernels). Con- Then the induced topology (MMD) is strong enough ˆ 2 sider distributions P = N (0, 1),Qv = N (0, v ) and to separate distinct probability measures. Meanwhile, H1 induced by the linear kernel k1(x, y) := xy. RKHS allows for efficient computation using tools from H1 is small since it only contains linear functions. kernel methods, as shown in Section 4. Therefore, our ˆ MMDk1 (P,Qv) = 0, ∀v 6= 0 since they share the first insight is that universal RKHSs are large enough for Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

DRO applications. Using the reproducing property of RKHSs and conic Remark. For the RKHS associated with the Gaus- duality, we arrive at (4) with weak duality. Our strong sian kernel, the diameter of the space can be com- duality proof is an extension of the conic strong du- ality in Eulidean spaces. We rely on the existance of puted: ∀p, q, kµp − µqkH ≤ kµpkH + kµqkH ≤ p separating hyperplnes between convex sets in locally 2 sup k(x, y) = 2. Hence, if  ≥ 2, C contains x,y convex function spaces, e.g., H. See the illustration all probability distributions. Then, Kernel DRO is in Figure 2. In our generalized duality theorem, this again reduced to worst-case RO on domain X . separating hyperplane is determined by the witness ∗ We now turn to DRO with a generalized class of integral function f , which is the optimal dual variable in (4). probability metrics (IPM). See the appendix for the full proof. Example 3.5 (Generalization to IPM-DRO). Suppose ˆ R ˆ dF (P, P ) := supf∈F fd(P − P ) is the IPM defined by some function class F. The IPM-DRO primal for- mulation is given by

Z min sup l(θ, ξ) dP (ξ). (3) θ Figure 2: Illustration of a separating hyperplane in H dF (P,Pˆ)≤ Theorem 3.1 generalizes the classical bounds in gen- If we choose the class F = {f : kfkH ≤ 1}, we recover eralized moment problems (Isii, 1962; Lasserre, 2002; Kernel DRO with the RKHS norm-ball set in Table 1. Shapiro, 2001; Bertsimas and Popescu, 2005; Popescu, Similarly, F = {f : lip(f) ≤ 1} recovers the (type- 2005; Vandenberghe et al., 2007; Van Parys et al., 2016) 1) Wasserstein-DRO. This puts Wasserstein-DRO and to infinitely many moments using RKHSs. A distinc- Kernel DRO into a unified perspective. tion between Theorem 3.1 and other DRO approaches is that it uses the density of universal RKHSs to find 3.2 Generalized duality theorem a surrogate which can sharply bound the worst-case risk. This means that we do not require the loss l(θ, ·) We now present the main theorem of this paper, the to be affine, quadratic, or living in a known RKHS, generalized duality theorem of Kernel DRO (2). nor do we require the knowledge of Lipschitz constant l(θ, ·) Theorem 3.1 (Generalized Duality). Under As- or RKHS norm of . To our knowledge, existing sumption 3.1, (2) is equivalent to works typically require one of such assumptions. Moreover, Theorem 3.1 generalizes existing RO and (D) := min f + δ∗(f) 0 C DRO in the sense that it gives us a unifying tool to θ,f0∈R,f∈H (4) work with various ambiguity and ambiguity sets, which subject to l(θ, ξ) ≤ f0 + f(ξ), ∀ξ ∈ X may be customized for specific applications. We outline ∗ where δ (f) := sup hf, µiH is the support function a few closed-form expressions of the support function C µ∈C δ∗(f) of C, i.e., (P ) = (D), strong duality holds for the inner C in Table 1, and more in Table 3. We now return moment problem for any θ point-wise. IPM-DRO with a duality result. Corollary 3.1.1 (IPM-DRO duality). Given the inte- ˆ R The theorem holds regardless of the dependency of gral probability metric dF (P, P ) := supf∈F fd(P − l θ l θ on , e.g., non-convexity. If is convex in , then Pˆ), a dual program to (3) is given by (4) is a convex program. Formulation (4) has a clear geometric interpretation: we find a function f0 +f that N 1 X majorizes l(θ, ·) and subsequently minimize a surrogate min f0 + λf(ξi) + λ θ,λ≥0,f0∈R,f∈F N loss involving f0 and f. This is illustrated in Figure 1b. i=1 (5) Note the term duality here refers to the inner moment subject to l(θ, ξ) ≤ f0 + λf(ξ), ∀ξ ∈ X . problem. The statement can be further simplified by replacing f0 +f with f. However, we choose the current The reduction to (4) as a special case can be seen by notation for the sake of its explicit connection to RO. replacing λf with f and choosing F = H. Proof sketch. Our weak duality proof follows stan- We now establish an explicit connection between DRO dard paradigms of Lagrangian relaxation by introduing and stochastic optimization with expectation con- dual variables. Notably, we associate the functional straint, whose solution methods using stochastic ap- R constraint φ dP = µ with a dual function f ∈ H, proximation are an topic of active research (Lan and which is the decision variable in the dual problem (4). Zhou, 2020; Xu, 2020). Kernel Distributionally Robust Optimization

Corollary 3.1.2. (Stochastic optimization with ex- sample, i.e., pectation constraint) Under the Assumption 3.1, the min f0 + f(0) + kfkH optimal value of (2) coincides with that of (d) := θ,f∈H,f0∈R ∗ subject to subject to [|θ| − 1]+ ≤ f0 + f(0) min f0 + δC(f) θ,f0∈R,f∈H ∗ ∗ ∗ (6) which admits an optimal solution θ = 0, f = 0, f0 = 0 subject to h (l (θ, ζ) − f0 − λf (ζ)) ≤ 0 E and the worst-case risk (d) = 0. However, let µP 0 = 1 φ(0) + 1 φ(2) P 0 ∈ for some function h that satisfies h(t) = 0 if t ≤ 2 2 . It is straightforward to verify C, R l(θ∗, ξ) dP 0(ξ) = 1 > (d) θ∗ 0, h(t) > 0 if t > 0, and ζ ∼ µ whose 2 , i.e., the solution is 0 probability measure places positive mass on any non- not robust against P . empty open subset of X , i.e., µ(B) > 0, ∀B ⊆ X ,B =6 ∅,B is open. 4 COMPUTATION

A choice for h is h(·) = [·]+, which is used in the con- Given a certain parametrization of the RKHS function ditional value-at-risk (Rockafellar and Uryasev, 2000). f, (4) is a semi-infinite program (SIP) (Guerra Vázquez We will see the computational implication of Corol- et al., 2008). In the following, we propose two computa- lary 3.1.2 in Section 4. tional methods that do not require a polynomial loss l or We now establish further theoretical results as a con- the knowledge of its Lipschitz constant. For simplicity, sequence of the generalized duality theorem to help we only derive the formulations for the RKHS-norm- us understand the geometric intuition of how Kernel ball ambiguity set, while other formulations are given DRO works. By the weak duality (P ) ≤ (D) of (11) in Table 1, 3. R ∗ and (12), we have l dP ≤ f0 + δ (f). Specifically, if C A batch approach by discretization of SIP. We C is the RKHS norm-ball in Table 1 , this inequal- R 1 PN first consider an approach based on the discretization ity becomes l dP ≤ f0 + f(ξi) + kfkH. Its N i=1 method of SIP (Guerra Vázquez et al., 2008). Let right-hand-side can be seen as a computable bound for us consider an ambiguity set smaller than the RKHS- the worst-case risk when generalizing to P . This may M norm ball of distributions supported on some {ζj} ⊆ be useful when the Lipschitz constant of l is not known j=1 X . Then it suffices to consider the following program, or hard to obtain, as is often the case in practice. The which relaxes the constraint of (4) to finite support. following insight is a consequence of a generalization complementarity condition N of the classical of convex 1 X optimization; see the appendix. min f0 + f(ξi) + kfkH θ,f∈H,f0∈R N (7) Corollary 3.1.3 (Interpolation property). Given θ, i=1 ∗ ∗ ∗ subject to l(θ, ζi) ≤ f(ζj) + f0, j = 1 ...M. let P , f , f0 be a set of optimal primal-dual solutions ∗ ∗ associated with (P) and (D), then l(θ, ξ) = f0 + f (ξ) Note (7) is a convex program if l(θ, ξ) in convex in θ. holds P ∗-almost everywhere. We can parametrize the RKHS function f by a wealth of ∗ ∗ Intuitively, this result states that f0 + f interpolates tools from kernel methods, such as the random features the loss l(θ, ·) at the support points of P ∗. This is illus- fˆ(ξ) = w>φˆ(ξ) for large scale problems (Rahimi and trated in Figure 1 (b) and later empirically validated Recht, 2008). Alternatively, for small problems, we can in Figure 3. We can also see that the size of RKHS H parametrize f by a kernel expansion on the the points ∗ ∗ matters since, if H is small (e.g., H = {0}), f0 + f ζi. We provide concrete plug-in forms in the appendix. cannot interpolate the loss l well. On the other hand, As an interesting by-product of (7), let us derive an the density of universal RKHS allows the interpolation unconstrained version of (7), which gives rise to a gen- of general loss functions. eralized risk measure that we term kernel conditional It is tempting to approximately solve (4) by relaxing value-at-risk (Kernel CVaR). the constraint to hold for only the empirical samples, Example 4.1 (Kernel CVaR). i.e., l(θ, ξi) ≤ f0 + f(ξi), i = 1 ...N. The following ob- M servation cautions us against this. 1 X K-CVaRα(X) := inf [X−f(ζj)−f0]+ Example 3.6 (Counterexample: relaxation of the f∈H,f0∈ αM R j=1 semi-infinite constraint). Let H be a Gaussian √ N RKHS with the bandwidth σ = 2. Suppose 1 X + f0 + f(ξi) + kfkH. (8) our data set is {0} and the ambiguity set is N p i=1 C := {µ: kµ − φ(0)kH ≤ }. Let  = 2 − 2/e. We M N consider the loss function l(ξ) = [|θ + ξ| − 1]+ and re- If f = 0 and {ζj}j=1 = {ξi}i=1, then (8) is reduced to laxing the constraint of (4) to only hold at the empirical the classical CVaR (Rockafellar and Uryasev, 2000). Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

Program (7) can be readily solved using off-the-shelf used in Step 5 of the algorithm. It is worth noting that, convex solvers. However, to scale up to large data when used with a primal SA approach such as (Lan sets, we next develop a stochastic approximation (SA) and Zhou, 2020), SFG-DRO completely operates in the method for Kernel DRO. dual space (an RKHS) since Kernel DRO (4) is based on the generalized duality Theorem 3.1. This interplay Stochastic functional gradient DRO. We now between the primal (measures) and dual (functions) is present our SA approach enabled by Theorem 3.1 by em- the essence of our theory. ploying two key tools: 1) scalable approximate RKHS features, such as random Fourier features (Rahimi and 5 NUMERICAL STUDIES Recht, 2008; Dai et al., 2014; Carratino et al., 2018), and 2) stochastic approximation with semi-infinite and expectation constraints (Tadić et al., 2006; Lan and This section showcases the applicability of Kernel DRO Zhou, 2020; Baes et al., 2011; Xu, 2020). (and hence SFG-DRO) and discusses the robustness- optimality trade-off. Our purpose is not to bench- Let us summon Corollary 3.1.2 to formulate a stochastic mark state-of-art performances or to demonstrate program with expectation constraint. the superiority of a specific algorithm. Indeed, we

N believe both RO and DRO are elegant theoretical 1 X frameworks that have their specific use cases. We min f0 + f(ξi) + kfkH θ,f0∈ ,f∈H N R i=1 (9) note that our theory can be applied to a broader scope of applications than the examples here, such subject to [l (θ, ζ) − f − λf (ζ)] ≤ 0 E 0 + as stochastic optimal control. See the appendix for where ζ follows a certain proposal distribution on X , more experimental results. The code is available at e.g., uniform or by adaptive sampling. An alternative https://github.com/jj-zhu/kdro. is to directly solve (4) using SA techniques with SI constraints, such as (Tadić et al., 2006; Wei et al., Distributionally robust solution to uncertain 2020). (4), (6), and (9) are all convex in function f. least squares. We first consider a robust least We can compute the functional gradient by squares problem adapted from (El Ghaoui and Le- bret, 1997), which demonstrated an important appli- f ∇f f = φ, ∇f kfkH = . (10) cation of RO to statistical learning historically. (See kfkH also (Boyd et al., 2004, Ch. 6.4).) The task is to 2 minimize the objective kAθ − bk2 w.r.t. θ. A is mod- When used with approximate features of the form eled by A(ξ) = A0 + ξA1, where ξ ∈ X is uncertain, ˆ > ˆ ˆ 10×10 10 f(ξ) = w φ(ξ), we further have ∇wf(ξ) = X = [−1, 1], and A0,A1 ∈ R , b ∈ R are given. ˆ ˆ φ(ξ), ∇wkfkH = w/kwk2. We outline our stochastic We compare Kernel DRO against using (a) empirical functional gradient DRO (SFG-DRO) in Algorithm 1. risk minimization (ERM; also known as sample average 1 PN 2 approximation) that minimizes N i=1 kA(ξi) θ − bk2, Algorithm 1 Stochastic Functional Gradient DRO (b) worst-case RO via SDP from (El Ghaoui and (SFG-DRO) Lebret, 1997). We consider a data-driven set- N 1: for k = 1, 2,... do ting with given samples {ξi}i=1 with the Kernel k min max kA(ξ) θ − 2: Sample mini-batch data {ξi } := {xi, yi}. Sam- DRO formulation θ P ∈P,µ∈C Eξ∼P 2 R ple {ζi} from some proposing distribution. bk2 subject to φdP = µ, where we choose the ambi- 3: Approximate f, e.g., by random Fourier feature guity set to be the -norm-ball in the RKHS (Table 1). ˆ k > ˆ k f(ξi ) = w φ(ξi ). N Empirical samples {ξi}i=1(N = 10) are generated 4: Estimate the stochastic functional gradient of uniformly from [−0.5, 0.5]. We then apply Kernel the objective and constraint in (9) using (10). DRO formulation (7). To test the solution, we cre- 5: θ, f , f Update 0 using the functional gradient with ate a distribution shift by generating test samples from any SA routine with expectation or semi-infinite [−0.5 · (1 + ∆), 0.5 · (1 + ∆)], where ∆ is a perturbation constraints, e.g., (Lan and Zhou, 2020; Xu, 2020; varying within [0, 4]. Figure 3a shows this comparison. Tadić et al., 2006; Wei et al., 2020) . As the perturbation increases, ERM quickly lost ro- bustness. On the other hand, RO is the most robust with the trade-off of being conservative. As expected, Compared with many batch-setting DRO approaches, Kernel DRO achieves some level of optimality while SFG-DRO can be used with general model classes, such retaining robustness. as neural networks, and is applicable to a broad class of optimization and modern learning tasks. The conver- We then ran Kernel DRO with fewer empirical samples gence guarantee follows that of the specific SA routine (N = 5) to show the geometric interpretations. We plot Kernel Distributionally Robust Optimization

2.0 1.0 3 1.0 0.20 1.0

0.8 1.5 0.8 0.15 0.8 2 K-DRO = 0.5 loss 0.6 1.0 0.6 0.10 0.6 ERM

K-DRO loss f0 + f * PGD

test loss = 0.1 1 P test error 0.4 RO 0.4 0.05 0.4 SFG-DRO 0.5 P ERM 0 0.2 0.2 0.00 0.2 0 1 2 3 4 0.0 0.0 0.2 0.4 1 0 1 perturbatio0.0n test perturba0.0tion 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.00 0.25 0.50 0.75 1.00 0.0 0.2 0.4 0.6 0.8 1.0 (a) Uncertain least squares loss (b) Geometric interpretation (c) MNIST classification error

1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1

1 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 0 0

(d) (Left) unperturbed data (Center) ERM classification result (red indicates errors) (Right) SFG-DRO (Kernel DRO)

Figure 3: Uncertain least squares. (a) This plot depicts the test loss of algorithms. All error bars are in standard error. We ran 10 independent trials. In each trial, we solved Kernel DRO to obtain θ ∗ and tested it on a test dataset of 500 samples. We then vary the perturbation ∆ from 0 to 4. (b) (red) is the dual optimal ∗ ∗ ∗ solution f0 + f . (black) is the function l(θ , ·). The pink bars depict a worst-case distribution while the blue ∗ ∗ ∗ bars the empirical distribution. We can observe that f0 + f touches loss l(θ , ·) at the support of the worst-case distribution P ∗ (pink dots). Note f ∗ (normalized) can be viewed as a witness function of the two distributions. Classification under perturbation (c) We plot the classification error rate during test time. The x−axis is the perturbation magnitude allowed on the test data. For ERM, PGD, and SFG-DRO (Kernel DRO), we train 5 independent models. Each model is tested on 500 randomly sampled images. (d) We visualize the predictions of ERM and SFG-DRO on the perturbed images with perturbation magnitude ∆ = 0.2. Blue frames indicate correct predictions while the red ones indicate errors.

∗ ∗ the optimal dual solution f0 +f in Figure 3b. Recall it (cf. (Madry et al., 2019; Madry)). Note the overall loss is an over-estimator of the loss l(θ, ·). We solve the inner of PGD is an average loss instead of a worst-case one. moment problem (see appendix) to obtain a worst-case Hence it is already less conservative than RO. We train ∗ ∗ ˆ distribution P . Comparing P with P , we can observe a classification model gθ : x 7→ y using SFG-DRO in the adversarial behavior of the worst-case distribution. Algorithm 1, with the SA subroutine of (Lan and Zhou, See the caption for more description. From Figure 3b, 2020). During training, we set the ambiguity size of we can see that the intuition of Kernel DRO is to SFG-DRO as  = 0.5 and domain X to be norm-balls flatten the loss curve using a smooth function. around the training data X = {ζ = X+δ : kδk∞ ≤ 0.5} where X is the training data. Distributionally robust learning under adver- Figure 3d (left) plots unperturbed test samples. Fig- sarial perturbation. We now demonstrate the ure 3c shows the classification error rate as we increase framework of SFG-DRO in Algorithm 1 in a non-convex the magnitude of the perturbation ∆. We observe that setting. For simplicity, we consider a MNIST binary ERM attains good performance when there is no test- classification task with a two-layer neural network. We time perturbation but quickly underperforms as the emphasize that the deliberate choice of this simple ar- noise level increases. PGD is the most robust under chitecture ablates factors known to implicitly influence large perturbation, but has the worst nominal perfor- robustness, such as regularization and dropout. The mance. SFG-DRO possesses improved robustness while data set contains images of zero and one (i.e., two 28×28 its performance under no perturbation does not become classes). Each image x is represented by x ∈ [0, 1] . much worse. This is consistent with our theoretical The test data is perturbed by an unknown disturbance, insights into RO and DRO. i.e., x˜test := x + δ where x ∼ Ptest is the unperturbed test data and δ is the perturbation. In the plots, δ is generated by the PGD algorithm (Madry et al., 2019) using projected gradient descent to find the worst-case perturbation within a box {δ : kδk∞ ≤ ∆}}. We com- pared SFG-DRO (Kernel DRO) with ERM and PGD Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

6 OTHER RELATED WORK AND der the Marie Skłodowska-Curie grant agreement No DISCUSSION 798321. Moritz Diehl would like to acknowledge the funding support from the DFG via Project DI 905/3-1 This paper uses similar techniques of reformulating (on robust model predictive control) min-max programs as in (Ben-Tal et al., 2015; Bertsi- mas et al., 2017), but our ambiguity set is constructed References in an RKHS. Duchi et al. (2020) proposed variational approximations to marginal DRO to treat covariate Michel Baes, Michael Bürgisser, and Arkadi Ne- shift in supervised learning. The authors of (Zhu et al., mirovski. A randomized Mirror-Prox method for 2020) used kernel mean embedding for the inner mo- solving structured large-scale matrix saddle-point arXiv:1112.1274 [math] ment problem (but not DRO) and proved the statis- problems. , December 2011. tical consistency of the solution. The work of Staib Alexander Barvinok. A Course in Convexity, volume 54. and Jegelka (2019) used insights from DRO to moti- American Mathematical Soc., 2002. vate a regularizer for kernel ridge regression. DRO has Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Ne- been also applied to Bayesian optimization in (Rontsis mirovski. Robust Optimization, volume 28. Princeton et al., 2020; Kirschner et al., 2020), where the latter University Press, 2009. work used MMD ambiguity sets of distributions over discrete spaces. In terms of scalability, recent works Aharon Ben-Tal, Dick den Hertog, Anja De Waege- such as (Sinha et al., 2020; Li et al., 2019; Namkoong naere, Bertrand Melenberg, and Gijs Rennen. Ro- and Duchi, 2016) also explored DRO for modern ma- bust Solutions of Optimization Problems Affected by Management Science chine learning tasks. To the best of our knowledge, no Uncertain Probabilities. , 59(2): existing work contains the results such as generalized 341–357, February 2013. ISSN 0025-1909, 1526-5501. ambiguity set constructions in Table 1, 3, generalized doi: 10.1287/mnsc.1120.1641. duality theory underpinned by Theorem 3.1, or the Aharon Ben-Tal, Dick den Hertog, and Jean-Philippe stochastic functional gradient algorithm SFG-DRO. Vial. Deriving robust counterparts of nonlinear un- certain inequalities. Mathematical Programming, 149 In summary , this paper proves Theorem 3.1 that gen- (1):265–299, February 2015. ISSN 1436-4646. doi: eralizes the classical duality theory in the literature of 10.1007/s10107-014-0750-8. mathematical problem of moments and DRO. Using the density of universal RKHSs, the dual bound in The- Dimitris Bertsimas and Ioana Popescu. Optimal In- orem 3.1 is sharp while lifting restrictions on the loss equalities in : A Convex Opti- function class. The generalized primal formulations mization Approach. SIAM Journal on Optimization, shed light on the connection between Kernel DRO and 15(3):780–804, January 2005. ISSN 1052-6234, 1095- existing robust and stochastic optimization approaches. 7189. doi: 10.1137/S1052623401399903. Finally, the proposed stochastic approximation algo- Dimitris Bertsimas, Nathan Kallus, and Vishal Gupta. rithm SFG-DRO enables the applications of Kernel Data-Driven Robust Optimization. Springer Berlin DRO to modern learning tasks. Heidelberg, 2017. ISBN 1010701711258. doi: 10. 1007/s10107-017-1125-8. The compactness assumption on X can be further ex- tended, as universality can be extended to non-compact Alberto Bietti, Grégoire Mialon, Dexiong Chen, and domains (Sriperumbudur et al., 2011). In the special Julien Mairal. A Kernel Perspective for Regularizing case of RKHS-norm-ball ambiguity sets, choosing the Deep Neural Networks. arXiv:1810.00363 [cs, stat], size  can be motivated using kernel statistical test- May 2019. ing (Gretton et al., 2012). However, when DRO is used Jose Blanchet, Yang Kang, and Karthyek Murthy. Ro- in the setting where test distributions are perturbed as bust Wasserstein Profile Inference and Applications in our examples, existing statistical guarantees in the to Machine Learning. Journal of Applied Probability, literature for unperturbed settings cannot be directly 56(03):830–857, September 2019. ISSN 0021-9002, applied. This is a topic of future work. 1475-6072. doi: 10.1017/jpr.2019.49. Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Acknowledgements Vapnik. A training algorithm for optimal margin clas- We thank Daniel Kuhn for the helpful discussion during sifiers. In Proceedings of the Fifth Annual Workshop a workshop at IPAM, UCLA. We also thank Yassine on Computational Learning Theory, pages 144–152, Nemmour and Simon Buchholz for sending us their feed- 1992. back on the paper draft. During part of this project, Stephen Boyd, Stephen P. Boyd, and Lieven Vanden- Jia-Jie Zhu was supported by the European Union’s berghe. Convex Optimization. Cambridge University Horizon 2020 research and innovation programme un- Press, March 2004. ISBN 978-0-521-83378-3. Kernel Distributionally Robust Optimization

G.C. Calafiore and M.C. Campi. The scenario approach A tutorial. Journal of Computational and Applied to robust control design. IEEE Transactions on Mathematics, 217(2):394–419, August 2008. ISSN Automatic Control, 51(5):742–753, May 2006. ISSN 0377-0427. doi: 10.1016/j.cam.2007.02.012. 2334-3303. doi: 10.1109/TAC.2006.875041. Keiiti Isii. On sharpness of tchebycheff-type inequalities. Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Annals of the Institute of Statistical Mathematics, 14 Learning with SGD and Random Features. In S. Ben- (1):185–197, December 1962. ISSN 1572-9052. doi: gio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa- 10.1007/BF02868641. Advances in Neu- Bianchi, and R. Garnett, editors, Garud N. Iyengar. Robust Dynamic Programming. ral Information Processing Systems 31 , pages 10192– Mathematics of Operations Research, 30(2):257–280, 10203. Curran Associates, Inc., 2018. 2005. ISSN 0364-765X. Andreas Christmann and Ingo Steinwart. Consistency Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, and robustness of kernel-based regression in convex and Andreas Krause. Distributionally Robust Bernoulli risk minimization. , 13(3):799–819, August Bayesian Optimization. arXiv:2002.09038 [cs, stat], 2007. ISSN 1350-7265. doi: 10.3150/07-BEJ5102. March 2020. John B Conway. A course in functional analysis, vol- Guanghui Lan and Zhiqiang Zhou. Algorithms for ume 96. Springer, 2019. stochastic optimization with function or expectation Corinna Cortes and Vladimir Vapnik. Support-vector constraints. Computational Optimization and Appli- networks. Machine learning, 20(3):273–297, 1995. cations, February 2020. ISSN 0926-6003, 1573-2894. Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, doi: 10.1007/s10589-020-00179-x. Maria-Florina F Balcan, and Le Song. Scalable Jean B. Lasserre. Bounds on measures satisfying mo- Kernel Methods via Doubly Stochastic Gradients. ment conditions. The Annals of Applied Probability, In Z. Ghahramani, M. Welling, C. Cortes, N. D. 12(3):1114–1137, 2002. Lawrence, and K. Q. Weinberger, editors, Advances Jiajin Li, Sen Huang, and Anthony Man-Cho So. in Neural Information Processing Systems 27, pages A First-Order Algorithmic Framework for Wasser- 3041–3049. Curran Associates, Inc., 2014. stein Distributionally Robust Logistic Regression. Erick Delage and Yinyu Ye. Distributionally Robust arXiv:1910.12778 [cs, math, stat], October 2019. Optimization Under Moment Uncertainty with Ap- Aleksander Madry, Aleksandar Makelov, Ludwig plication to Data-Driven Problems. Operations Re- Schmidt, Dimitris Tsipras, and Adrian Vladu. To- search, 58(3):595–612, June 2010. ISSN 0030-364X, wards Deep Learning Models Resistant to Adversar- 1526-5463. doi: 10.1287/opre.1090.0741. ial Attacks. arXiv:1706.06083 [cs, stat], September John Duchi, Peter Glynn, and Hongseok Namkoong. 2019. Statistics of robust optimization: A generalized Zico Kolter and Aleksander Madry. Adversarial Ro- empirical likelihood approach. arXiv preprint bustness - Theory and Practice. http://adversarial- arXiv:1610.03425, 2016. ml-tutorial.org/. John Duchi, Tatsunori Hashimoto, and Hongseok Namkoong. Distributionally Robust Losses for La- Peyman Mohajerin Esfahani and Daniel Kuhn. Data- tent Covariate Mixtures. arXiv:2007.13982 [cs, stat], driven distributionally robust optimization using the July 2020. Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Program- Laurent El Ghaoui and Hervé Lebret. Robust So- ming, 171(1):115–166, September 2018. ISSN 1436- lutions to Least-Squares Problems with Uncertain 4646. doi: 10.1007/s10107-017-1172-1. Data. SIAM Journal on Matrix Analysis and Ap- plications, 18(4):1035–1064, October 1997. ISSN Hongseok Namkoong and John C Duchi. Stochas- 0895-4798. doi: 10.1137/S0895479896298130. tic Gradient Methods for Distributionally Robust Optimization with f-divergences. In D. D. Lee, Rui Gao and Anton J. Kleywegt. Distributionally M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Gar- Robust Stochastic Optimization with Wasserstein nett, editors, Advances in Neural Information Pro- Distance. arXiv:1604.02199 [math], July 2016. cessing Systems 29, pages 2208–2216. Curran Asso- Arthur Gretton, Karsten M. Borgwardt, Malte J. ciates, Inc., 2016. Rasch, Bernhard Schölkopf, and Alexander Smola. A Arnab Nilim and Laurent El Ghaoui. Robust Con- Journal of Machine Learning kernel two-sample test. trol of Markov Decision Processes with Uncertain Research , 13(Mar):723–773, 2012. Transition Matrices. Operations Research, 53(5):780– F. Guerra Vázquez, J. J. Rückmann, O. Stein, and 798, October 2005. ISSN 0030-364X, 1526-5463. doi: G. Still. Generalized semi-infinite programming: 10.1287/opre.1050.0216. Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

Imre Pólik and Tamás Terlaky. A Survey of the S- Alex Smola, Arthur Gretton, Le Song, and Bernhard Lemma. SIAM Review, 49(3):371–418, January Schölkopf. A hilbert space embedding for distribu- 2007. ISSN 0036-1445, 1095-7200. doi: 10.1137/ tions. In International Conference on Algorithmic S003614450444614X. Learning Theory, pages 13–31. Springer, 2007. Ioana Popescu. A Semidefinite Programming Approach Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert to Optimal-Moment Bounds for Convex Classes of R. G. Lanckriet. Universality, Characteristic Kernels Distributions. Mathematics of Operations Research, and RKHS Embedding of Measures. Journal of 30(3):632–657, August 2005. ISSN 0364-765X, 1526- Machine Learning Research, 12(Jul):2389–2410, 2011. 5471. doi: 10.1287/moor.1040.0137. ISSN ISSN 1533-7928. Ali Rahimi and Benjamin Recht. Random Features Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur for Large-Scale Kernel Machines. In J. C. Platt, Gretton, Bernhard Schölkopf, and Gert R. G. Lanck- D. Koller, Y. Singer, and S. T. Roweis, editors, Ad- riet. On the empirical estimation of integral prob- vances in Neural Information Processing Systems 20, ability metrics. Electronic Journal of Statistics, 6: pages 1177–1184. Curran Associates, Inc., 2008. 1550–1599, 2012. ISSN 1935-7524. doi: 10.1214/ R. Tyrrell Rockafellar. Convex Analysis. Number 28. 12-EJS722. Princeton university press, 1970. Matthew Staib and Stefanie Jegelka. Distribution- ally robust optimization and generalization in kernel R. Tyrrell Rockafellar and Stanislav Uryasev. Opti- methods. In Advances in Neural Information Pro- mization of conditional value-at-risk. Journal of risk, cessing Systems, pages 9131–9141, 2019. 2:21–42, 2000. Ingo Steinwart and Andreas Christmann. Support Vec- W W Rogosinski. Moments of Non-Negative Mass. tor Machines. Springer Science & Business Media, page 28. 2008. Nikitas Rontsis, Michael A. Osborne, and Paul J. Vladislav B. Tadić, Sean P. Meyn, and Roberto Tempo. Goulart. Distributionally Ambiguous Optimization Randomized Algorithms for Semi-Infinite Program- for Batch Bayesian Optimization. Journal of Ma- ming Problems. In Giuseppe Calafiore and Fab- chine Learning Research, 21(149):1–26, 2020. ISSN rizio Dabbene, editors, Probabilistic and Randomized 1533-7928. Methods for Design under Uncertainty, pages 243– Herbert Scarf. A min-max solution of an inventory 261. Springer, London, 2006. ISBN 978-1-84628-095- problem. Studies in the mathematical theory of in- 5. doi: 10.1007/1-84628-095-8_9. ventory and production, 1958. Bart P. G. Van Parys, Paul J. Goulart, and Daniel B. Schölkopf, R. Herbrich, and A. J. Smola. A gen- Kuhn. Generalized Gauss inequalities via semidefi- eralized representer theorem. In D. Helmbold and nite programming. Mathematical Programming, 156 R. Williamson, editors, Annual Conference on Com- (1-2):271–302, March 2016. ISSN 0025-5610, 1436- putational Learning Theory, number 2111 in Lecture 4646. doi: 10.1007/s10107-015-0878-1. Notes in Computer Science, pages 416–426, Berlin, Lieven. Vandenberghe, Stephen. Boyd, and Kather- 2001. Springer. ine. Comanor. Generalized Chebyshev Bounds Bernhard Schölkopf, Alex J. Smola, Robert C. via Semidefinite Programming. SIAM Review, 49 Williamson, and Peter L. Bartlett. New support (1):52–64, January 2007. ISSN 0036-1445. doi: vector algorithms. Neural computation, 12(5):1207– 10.1137/S0036144504440543. 1245, 2000. Aladin Virmaux and Kevin Scaman. Lipschitz regular- Alexander Shapiro. On Duality Theory of Conic Lin- ity of deep neural networks: Analysis and efficient ear Problems. In Panos Pardalos, Miguel Á. Gob- estimation. In S. Bengio, H. Wallach, H. Larochelle, erna, and Marco A. López, editors, Semi-Infinite K. Grauman, N. Cesa-Bianchi, and R. Garnett, ed- Programming, volume 57, pages 135–165. Springer itors, Advances in Neural Information Processing US, Boston, MA, 2001. ISBN 978-1-4419-5204-2 978- Systems 31, pages 3835–3844. Curran Associates, 1-4757-3403-4. doi: 10.1007/978-1-4757-3403-4_7. Inc., 2018. Alexander Shapiro, Darinka Dentcheva, and Andrzej Zizhuo Wang, Peter W. Glynn, and Yinyu Ye. Like- Ruszczyński. Lectures on Stochastic Programming: lihood robust optimization for data-driven prob- Modeling and Theory. SIAM, 2014. lems. Computational Management Science, 13(2): Aman Sinha, Hongseok Namkoong, Riccardo Volpi, 241–261, April 2016. ISSN 1619-697X, 1619-6988. and John Duchi. Certifying Some Distributional doi: 10.1007/s10287-015-0240-3. Robustness with Principled Adversarial Training. Bo Wei, William B. Haskell, and Sixiang Zhao. The arXiv:1710.10571 [cs, stat], May 2020. CoMirror algorithm with random constraint sam- Kernel Distributionally Robust Optimization

pling for convex semi-infinite programming. Annals of Operations Research, September 2020. ISSN 1572- 9338. doi: 10.1007/s10479-020-03766-7. Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pages 5286–5295. PMLR, 2018. Huan Xu, Constantine Caramanis, and Shie Mannor. Robustness and regularization of support vector ma- chines. Journal of machine learning research, 10(7), 2009. Yangyang Xu. Primal-Dual Stochastic Gradient Method for Convex Programs with Many Functional Constraints. SIAM Journal on Optimization, 30(2): 1664–1692, January 2020. ISSN 1052-6234, 1095- 7189. doi: 10.1137/18M1229869. Chaoyue Zhao and Yongpei Guan. Data-driven risk- averse stochastic optimization with Wasserstein met- ric. Operations Research Letters, 46(2):262–267, March 2018. ISSN 01676377. doi: 10.1016/j.orl. 2018.01.011. Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, and Bernhard Schölkopf. Worst-Case Risk Quantification under Distributional Ambiguity using Kernel Mean Embedding in Moment Problem. arXiv:2004.00166 [cs, eess, math], March 2020. Steve Zymler, Daniel Kuhn, and Berç Rustem. Distribu- tionally robust joint chance constraints with second- order moment information. Mathematical Program- ming, 137(1-2):167–198, February 2013. ISSN 0025- 5610, 1436-4646. doi: 10.1007/s10107-011-0494-7. Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf Appendix: Kernel Distributionally Robust Optimization

A PROOFS OF THEORETICAL RESULTS

Table 2 provides an overview to help readers navigate the theoretical results in this paper.

Table 2: List of the theoretical results in this paper

Theorems Generalized Duality Theorem 3.1 Strong duality of the inner moment problem Proposition A.1 Interpolation property Proposition 3.1.3 Complementarity condition Lemma A.2 Robust representer theorem Proposition B.1 IPM-DRO duality Corollary 3.1.1 Kernel DRO as stochastic optimization with expectation constraint Corollary 3.1.2 Formulations Kernel DRO primal (P) (2), dual (D) (4) IPM-DRO primal (3), dual (5) Formulations for various RKHS ambiguity sets Table 1, 3 Stochastic program with expectation constraint formulation of Kernel DRO (6),(9) Program to compute worst-case distributions (23),(24),(25) Kernel DRO convex program by the discretization of SIP (7) Kernel conditional value-at-risk (8)

In general, we refer to standard texts in optimization (Boyd et al., 2004; Shapiro et al., 2014; Ben-Tal et al., 2009), convex analysis (Rockafellar, 1970; Barvinok, 2002), and functional analysis (Conway, 2019) for more mathematical background.

Notation. In the proofs, we use M to denote the space of signed measures on X . The dual cone of a set of ∗ R signed measures K ⊆ M is defined as K := {h: h dm ≥ 0, ∀m ∈ K, h measurable}. Using the reproducing R property, we have the identity f dP = hf, µP iH for f ∈ H and P ∈ P, which we will frequently use in the proofs.

A.1 Proof of the Generalized Duality Theorem 3.1

We now derive our key result for Kernel DRO — the Generalized Duality Theorem, in Theorem 3.1. Let us first consider the inner moment problem of (2) Z Z sup l dP subject to φ dP = µ, (11) P ∈P,µ∈C

where we suppress θ in l(θ, ·) as we fix it for the moment. (11) generalizes the problem of moments in the sense that the constraint can be viewed as infinite-order moment constraints. Using conic duality, we obtain the strong duality of the inner moment problem. Proposition A.1 (Strong dual to (11)). Under Assumption 3.1, (11) is equivalent to solving

∗ min δC(f) + f0 f0∈R,f∈H (12) subject to l(ξ) ≤ f0 + f(ξ), ∀ξ ∈ X

∗ where δC is the support function of C, i.e., strong duality holds.

Using Proposition A.1, we can reformulate the inner moment problem in (2) to obtain Theorem 3.1. We now prove this generalized duality result for the inner moment problem in Proposition A.1. We first derive the weak dual and then prove the strong duality. Kernel Distributionally Robust Optimization

Proof. We first relax the constraint P ∈ P to its conic hull P ∈ co(P). To constrain P to still be a probability R measure, we impose 1 dP (x) = 1, which results in the primal problem equivalent to (11) Z Z Z (P ) := max l dP subject to φ dP = µ, 1 dP = 1. P ∈co(K),µ∈C

We construct the Lagrangian relaxation by associating the constraints with the dual variables f ∈ H, f0 ∈ R, as R well as adding the indicator function of C. Note both sides of the constraint φ dP = µ are functions in H, hence the multiplier f is an RKHS function. Z Z Z L(P, µ; f, f0) = l dP − δC(µ) + hµ − φ dP , fiH + f0(1 − 1 dP ) Z Z Z = l dP − δC(µ) + hµ, fiH − f dP + f0 − f0 dP Z = l − f − f0 dP + (hµ, fiH − δC(µ)) + f0. (13)

The second equality is due to the reproducing property of RKHS. The dual function is given by Z g(f, f0) = sup l − f − f0 dP + (hµ, fiH − δC(µ)) + f0. P,µ

∗ The first term is bounded above by 0 iff l − f − f0 ∈ −K . By Lemma D.1, this conic constraint is equivalent to the constraint of (12), l(ξ) ≤ f0 + f(ξ), ∀ξ ∈ X . ∗ Finally, expressing the second term using convex conjugate δC(f) = supµhµ, fiH − δC(µ) concludes the derivation.

Strong duality can potentially be adapted from the strong duality result of moment problem, e.g., (Shapiro, 2001). However, we give a self-contained proof with only elementary mathematics that sheds light on the connection between the RKHS theory and distributionally robust optimization. The proof is a generalization of the Euclidean space conic duality theorem ((Ben-Tal et al., 2009) Theorem A.2.1) to infinite dimensions. Figure 4 illustrates the idea of the proof.

Figure 4: Illustration of the strong duality proof that uses a separating hyperplane. See the proof for detailed descriptions.

Proof. We assume the dual optimal value of (12) is finite (D) < ∞. Since the converse means that the dual problem is infeasible, which implies that the primal problem is unbounded. Due to the upper semicontinuity of l in Assumption 3.1, this can not happen on a compact X .

Let us consider the Hilbert space R × H × R equipped with the inner product h, iR + h,iH + h, iR. We construct a cone in × H × R R  Z Z Z   A = 1 dP , φdP, l dP : P ∈ co(P) , where co again denotes conic hull. Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

Let t = (P ) denote the optimal primal value. ∀ > 0, we construct the set   B = (1, µ, t + ): µ ∈ C ,

which is a closed convex set with non-empty relative interior by Assumption 3.1 (i.e., Slater condition is satisfied).

It is straightforward to verify that those two sets do not intersect. Suppose x = (x1, x2, x3) ∈ A ∩ B, this means 0 0 R 0 0 R 0 0 0 ∃µ ,P such that x1 = 1 = 1 dP , x2 = µ = φdP , i.e., µ ,P is a primal feasible solution. Then the third R 0 coordinate of x satisfies x3 = l dP ≤ (P ) < t +  = x3, which is impossible. Hence, A ∩ B = ∅. In the rest of the proof, we will show that, ∀ > 0, the dual optimal value (D) satisfies (D) ≤ (P ) + .

Combining this with weak duality (D) ≥ (P ) will result in strong duality. We now justify this inequality. By the separation theorem, (see, e.g., (Barvinok, 2002) Theorem III.3.2, 3.4), there exists a closed hyperplane R that strictly separates A and B. The separation is strict because t +  > t = l dP . By the Riesz representation theorem, ∃(f0, f, τ) ∈ R × H × R, s ∈ R, such that

f0 + hf, µiH + τ(t + ) < s, Z Z Z f0 1 dP + hf, φdP iH + τ l dP > s.

Plugging in P = 0, we obtain s < 0. Since P lives in a cone, the left-hand side of the second inequality must be non-negative. Otherwise, we can scale P so that the separation will fail. In summary, we have

f0 + hf, µiH + τ(t + ) < 0, ∀µ ∈ C, Z Z Z (14) f0 1 dP + hf, φdP iH + τ l dP ≥ 0, ∀P ∈ co(P).

By Assumption 3.1 (Slater condition), the primal problem has a non-empty solution set. Because l is proper and upper semi-continuous and the feasible solution set for the optimization problem is compact (see Section D) , the primal optimum is attained by the extreme value theorem. Suppose P ∗ is a primal optimal solution, from the second inequality of (14), Z Z Z ∗ ∗ ∗ f0 1 dP + hf, φdP iH + τ l dP = f0 + hf, µP ∗ iH + τt ≥ 0.

Using this and the first inequality of (14), we obtain τ < 0. Without loss of generality, we let τ = −1. From the second inequality of (14), we have Z Z Z Z f0 1 dP + hf, φdP iH − l dP = f0 + f − l dP ≥ 0, ∀P ∈ co(P).

This tells us that f0, f is a feasible dual solution because it satisfies the semi-infinite constraint in (12). By the first inequality of (14), f0 + suphf, µiH ≤ t + , ∀ > 0, µ∈C where the left-hand side is precisely the dual objective in (12). This implies (D) ≤ (P ) + . By weak duality, (D) ≥ (P ). Therefore, strong duality holds.

This proof gives us the third interpretation of the dual variables f0, f — they define a separating hyperplane of A and B. Remark. From the proof, we see that the Slater condition in Assumption 3.1 is stronger than needed be. If C is singleton, we can still find a convex neighborhood W of the singleton B since R × H × R is locally convex. Then W and A can still be strictly separated using the same technique in the proof. Hence strong duality still holds when C is a singleton. Kernel Distributionally Robust Optimization

Table 3: Robust counterpart formulations of Kernel DRO.

RKHS ambiguity set C Robust counterpart formulation 1 PN norm-ball C = {µ: kµ − µPˆ kH ≤ } f0 + N i=1 f(ξi) + kfkH C = conv{C ,..., C } f + max δ∗ (f) convex hull 1 N 0 i Ci (same under closure clconv{C} )

example: polytope f0 + maxi f(ξi) C = conv{φ(ξ1), . . . , φ(ξN )} (equivalent to SVMs/scenario opt. (Calafiore and Campi, 2006)) PN C f + PN δ∗ (f) Minkowski sum i=1 i 0 i=1 Ci

example: C = C1 + C2 f0 + maxi f(ξi) + kfkH C1 = {µ: kµkH ≤ } C2 = conv{φ(ξ1), . . . , φ(ξN )} f + PN α δ∗ (f) affine combination 0 i=1 i Ci PN PN C = i=1 αiCi, i=1 αi = 1 f + α PN f(ξ ) + (1 − α)δ∗ (f) example: data contamination 0 N i=1 i CQ C = {αµPˆ + (1 − α)µQ : µQ ∈ CQ} C = ∩N C f + PN δ∗ (f ), PN f = f Intersection i=1 i 0 i=1 Ci i i=1 i C ⊆ H f + PN δ∗ (f ) f ∈ H multiple kernels i i 0 i=1 Ci i where i 1 PN PN PN example: Ci = {µ: kµ − µPˆ kH ≤ i} f0 + N i=1 i=j fi(ξj) +  i=1 kfikHi nPN 1 o 1 PN singleton C = i=1 N φ(ξi) f0 + N i=1 f(ξi) (equivalent to ERM/SAA)

entire RKHS C = H f0+ δ0(f) (equivalent to worst-case RO (Ben-Tal et al., 2009))

Finally, we summarize the results above to prove the Kernel DRO Generalized Duality Theorem 3.1

Proof. Theorem (3.1) is obtained by reformulating the inner moment problem in (2) using the strong duality result in Proposition A.1, i.e.,

 Z Z  min sup l(θ, ξ) dP (ξ): φ dP = µ, P ∈ P, µ ∈ C θ P,µ   ∗ = min min f0 + δC(f): l(θ, ξ) ≤ f0 + f(ξ), ∀ξ ∈ X , (15) θ f0∈R,f∈H

which results in formulation (4).

A.2 Table 3 deriving formulations for various choices of RKHS ambiguity set C

We now derive the formulations of support functions for various RKHS ambiguity sets in Table 3.

ˆ PN 1 (RKHS norm-ball) Let us consider the ambiguity set of C = {µ: kµ − µˆkH ≤ }, where P = i=1 N δξi . The support function is given by

∗ δC(f) = suphf, µiH = hf, µˆiH + sup hf, µ − µˆiH = hf, µˆiH + kfkH µ∈C kµ−µˆkH≤

where the last equality is by the Cauchy-Schwarz inequality, or alternatively by the self-duality of Hilbert norms. (Note we assume there exists some µ ∈ H such that kµ − µˆkH = .) Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

(Polytope, convex hull of ambiguity set) The result for convex hull follows from standard support function ∗ calculus. If the ambiguity set C is described by the polytope conv{φ(ξ1), . . . , φ(ξN )}, then δC(f) = max1≤i≤N f(ξi). Furthermore, the support function value remains the same under closure operation 1 . The equivalence to the scenario approach in (Calafiore and Campi, 2006) can be seen by noticing that max1≤i≤N l(ξi) ≤ f0 + max1≤i≤N f(ξi). If H is universal, then there exists f0, f such that the equality is attained.

(Minkowski sum, affine combination, intersection) Those cases follow directly from the support function calculus; cf. (Ben-Tal et al., 2015).

(Kernel DRO with multiple kernels) Let us consider multiple ambiguity sets from different RKHSs.

Suppose H1,..., HNh are RKHSs associated with feature maps φ1, . . . , φNh . Let C1,..., CNh be the ambiguity sets in the respective RKHSs. Kernel DRO formulation with multiple kernels is given By Z Z  min sup l(θ, ξ) dP (ξ): φi dP = µi,P ∈ P, µi ∈ Ci, i = 1 ...Nh , (16) θ P,µ

Using the same proof as Proposition A.1, we have the Kernel DRO reformulation

N X min f + δ∗ (f ) 0 Ci i θ,f0∈R,fi∈Hi i=1 . N (17) X subject to l(θ, ξ) ≤ f0 + fi(ξ), ∀ξ ∈ X i=1

Hence we obtain the formulation in Table 3.

nPN 1 o (Singleton ambiguity set C = i=1 N φ(ξi) ) By the reproducing property, the support function of the ∗ 1 PN singleton ambiguity set is given by δC(f) = N i=1 f(ξi).

∗ (If C = H, reduction to classical RO) δH(f) 6= ∞ iff f = 0. Then (4) is reduced to

min f0 θ,f0∈R (18) subject to l(θ, ξ) ≤ f0, ∀ξ ∈ X

which is the epigraphic form of the worst-case RO. 2

A.3 Complementarity condition and proof

∗ ∗ ∗ Lemma A.2 (Complementarity condition). Let P , f , f0 be a set of optimal primal-dual solutions of (P) and (D), then Z Z ∗ ∗ ∗ ∗ ∗ ∗ ∗ l − f − f0 dP = 0, δC(f ) = f dP (19)

If C = {µ: kµ − µPˆ kH ≤ }, the second equality implies Z ∗ f ∗ ˆ ∗ ˆ ∗ d(P − P ) = MMD(P , P ), (20) kf kH

which gives a second interpretation of the dual solution f ∗ as a witness function. It is well known that complementarity condition holds iff strong duality holds in the moment problem; cf. (Shapiro, 2001). The following is a straightforward proof. 1The convex hull can be replaced with its closure clconv(·). Note convex hulls in infinite-dimensional spaces are not automatically closed; cf. Krein-Milman theorem. 2Note that C = H is no longer closed. However, the resulting ambiguity set becomes P, which is still compact if X is compact. Kernel Distributionally Robust Optimization

∗ ∗ ∗ Proof. Plug P , f , f0 into Lagrangian (13), Z Z ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ l dP ≤ l − f − f0 dP + δC(f ) + f0 ≤ δC(f ) + f0 .

∗ ∗ By strong duality, all inequalities above are equalities. Therefore, the first equality gives the condition δC(f ) = R ∗ ∗ R ∗ ∗ ∗ f dP while the second yields l − f − f0 dP = 0.

A.4 Proof of Proposition 3.1.3 (Interpolation property)

∗ ∗ Proof. Since f0 , f is a solution to the inner moment problem of Kernel DRO (12), we have l(θ, ξ) ≤ ∗ ∗ f0 + f (ξ), ∀ξ ∈ X for any given θ. By the first equation in the complementarity condition (19), we have R ∗ ∗ ∗ ∗ l − f − f0 dP = 0. Hence the integrand must be zero P -a.e.

A.5 Corollary 3.1.1 IPM-DRO duality

We provide a derivation using a technique alternative to the proof of Proposition A.1.

Proof. We consider the Lagrangian Z L(P ; λ) = l dP − λ(dF (P, Pˆ) − ) Z Z = l dP − λ sup fd(P − Pˆ) + λ f∈F Z Z = inf l − λf dP + λ fdPˆ + λ f∈F N λ X ≤ inf sup[l(ξ) − λf(ξ)] + f(ξi) + λ. (21) f∈F N ξ∈X i=1 The second equality above is due to the dual representation of IPM. The last inequality is due to that the expectation is always dominated by the supremum. This results in the reformulation

N λ X min sup[l(ξ) − λf(ξ)] + f(ξi) + λ. θ,λ≥0,f∈F N ξ∈X i=1

By introducing the epigraphic variable f0, we obtain the reformulation (5).

A.6 Corollary 3.1.2 Kernel DRO as stochastic optimization with expectation constraint

Using the known relationship between semi-infinite constraint and expectation constraint (see, e.g., (Tadić et al., 2006, Theorem 1)), the SI constraint in (4) is equivalent to the expectation constraint in (6).

B COMPUTATIONAL FORMULATIONS

We now provide practical plug-in formulations for computation. Specifically, we can parametrize the RKHS function f by, e.g., the following methods. We note that the random feature method is well-suited for large scale problems, such as in SFG-DRO applications.

B.1 Random features

Common ways to parametrize an RKHS function include the representer theorem as well as approximations such as the random Fourier features (Rahimi and Recht, 2008). Recall that an RKHS function can be approximated by the finite feature expansion

N ˆ > ˆ 0 X ˆ ˆ 0 f(ξ) ≈ f(ξ) = w φ(ξ), k(x, x ) ≈ φi(x)φi(x ) i=1 Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

ˆ N ˆ 2 where {φi(x)}i=1 are the random features, e.g., random Fourier features φi(x) = cos(wix + bi), wi ∼ N(0, σ ), bi ∼ 2 Uniform[0, 2π]. If x is a vector, then wi ∼ N (0, Iσ ), and wix is the dot product. See, e.g., (Rahimi and Recht, 2008), for more properties. One strength of the Generalized Duality Theorem 3.1 is that it does not require the knowledge of the RKHS that the loss l lives in, which is typically not available in non-kernelized models. This enables us to use approximate features for commonly used RKHSs, e.g., random Fourier feature. This is a strength of our Kernel DRO theory. Note program (7) is a convex optimization problem with the random feature parametrization.

B.2 Distributionally robust version of representer theorem

f(·) = PN β k(ξ , ·) + PM γ k(ζ , ·) kfk = √In program (7), we may parametrize the RKHS function by i=1 i i j=1 j j , H > > > α Kα, where α = (β1, . . . , βN , γ1, . . . , γM ) ,K = [k(ηi, ηj)], η = (ξ1, . . . , ξN , ζ1, . . . , ζM ) . We justify this parametrization by the following DRO version of the RKHS representer theorem (Schölkopf et al., 2001). The intuition of the following result is to restrict Kernel DRO to a smaller ambiguity set of distributions supported M on {ζi}i=1 (i.e., replace P ∈ P by P ∈ PM , an inner approximation depending on M). In this setting, the ambiguity set only contains only distributions supported on (a subset of) ζi. Then it suffices to parametrize f in PN PM (7) by f(·) = i=1 βik(ξi, ·) + j=1 γjk(ζj, ·). N Lemma B.1 (Robust representer). Given data {ξi}i=1 and the ambiguity set chosen to be a set of embeddings with PM PM the form j=1 αjφ(ζj), for some 0 ≤ αj ≤ 1, j=1 αj = 1, and within the RKHS norm-ball C = {µ: kµ−µPˆ kH ≤ PN PM }. Then, it suffices to consider the RKHS function of the form f(·) = i=1 βik(ξi, ·) + j=1 γjk(ζj, ·) for some βi, γj ∈ R, i = 1...N, j = 1...M.

Lemma B.1 states that the expansion points of the RKHS representer in (7) are exactly the support of the empirical and worst-case distributions. It extends the classical RKHS representer theorem (Schölkopf et al., 2001), which uses only the empirical samples as expansion points. The implication is that, to be distributionally robust, we should choose the representers as in Lemma B.1 instead of only using empirical samples. Below is a proof that is similar to the original representer theorem.

PN PM Proof. In (7) we consider f = fs + f⊥, where fs = i=1 βik(ξi, ·) + j=1 γjk(ζj, ·) belongs to a subspace of the H and f⊥ its complement. Plug in f = fs + f⊥ to (7) and note the orthogonality, we obtained

N 1 X min f0 + fs(ξi) + (kfskH + kf⊥kH) θ,fs,f⊥,f0 N i=1 (22)

subject to l(θ, ζi) ≤ fs(ζj) + f0, j = 1 ...M.

It suffices to choose f⊥ = 0 in this optimization problem. Hence the conclusion follows.

Remark. Note the existence of a worst case distribution in more general settings is not yet proven. The discussion here is restricted to the setting of (7).

C FURTHER NUMERICAL EXPERIMENT RESULTS

We carry out additional numerical experiments to study Kernel DRO.

C.1 Testing other variants of Kernel DRO

We empirically test the following proposed variants of Kernel DRO.

• Relaxed Kernel DRO formulation (Kernel DRO-relaxed) with constraint hold for only the empirical samples, i.e., l(θ, ξi) ≤ f0 + f(ξi), i = 1 ...N.

• Unconstrained Kernel DRO using Kernel CVaR in Example 4.1 Kernel Distributionally Robust Optimization

5 K-DRO 4 = 0.1 ERM 3 K-DRO-relaxed = 0.1 2

test loss K-DRO-CVaR = 0.1 1 = 0.01

0 0 1 2 3 4 perturbation

Figure 5: Comparing Kernel DRO-relaxed, Kernel DRO-KCVaR, ERM, and regular Kernel DRO. y-axis limit is adjusted to show the plot. All error bars are in standard error.

We compare Kernel DRO-relaxed with the ERM as well as the regular Kernel DRO. Compared with ERM, Kernel DRO-relaxed still possesses moderate robustness. In this case, we effectively proposed a way to apply RKHS regularization to general optimization problems, not limited to kernelized models. Hence, it may be used in practice as a finite-sample approximation to Kernel DRO. We then test the Kernel DRO using the unconstrained objective given by Kernel CVaR. We observe no significant difference in performance between Kernel DRO-KCVaR (with small chance constraint level α.) and regular Kernel DRO (7).

C.2 Analyzing the generalization behavior

An insight can be obtained by observing the plot of the MMD estimator between the training and test data in Figure 6 (left). As Kernel DRO with  = 0.5 robustified against perturbation less than the level MMD = 0.5, we see this threshold was exceeded as we increase the perturbation in test data. Meanwhile, this is the same time (∆ ≈ 1.5) R 1 PN where Kernel DRO solutions start to exceed the generalization bound l dP ≤ f0 + N i=1 f(ξi) + kfkH, see Figure 6 (right). This empirically validates our theoretical results for robustification.

3.0 0.6 2.5 0.5 2.0

0.4 1.5

test loss 1.0 0.3 K-DRO

MMD test vs. train 0.5 = 0.5 0.2 bound 0.0 0 2 4 0 1 2 3 4 perturbation perturbation

Figure 6: (Left) MMD estimator between the empirical samples and test samples. The level MMD = 0.5 is marked in red. (Right) Loss compared to the generalization bound. As the test data falls outside the robustification level 1 PN , the loss starts to exceed the generalization bound (red) f0 + N i=1 f(ξi) + kfkH.

C.3 Miscellaneous details for experimental set-up

Robust least squares example. Our experiments are implemented in Python. The convex optimization problems are solved using ECOS or MOSEK interfaced with CVXPY. In the experiments, we chose the bandwidth for the Gaussian kernel using the medium heuristic (Gretton et al., 2012).  in this paper are fixed to constants below 2 for Gaussian kernels. Choosing  can be further motivated by kernel statistical tests (Gretton et al., 2012) and is left for future work.

Sampled ζj In applying Kernel DRO using (7), we may obtain ζj by simply sampling in X . {ζj}i need not be real data, e.g., in stochastic control, they can be a grid of system states; in learning, they can be synthetic Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

0.4 1.0 0.2 P * 0.5 P 0.0 prob. mass. 1 0 0.0 1 0.0 0.2 0.4 0.6 0.8 1.0

(a) Support sampling (b) Perturbation

∗ Figure 7: Computing the worst-case distribution P using (a): (23) samples possible support ζj then optimizes w.r.t. weights α. (b): (24) moves the empirical samples directly. PN ˆ samples such as convex combinations of data ζj = i=1 aijξi, a·j ∈ SN (simplex), or perturbations ζj = ξi + ∆i where ∆i can be a small perturbation, or they can be obtained by domain knowledge of the specific application. In the setting of supervised machine learning, there is a difference between this paper’s approach of sampling ζj and commonly used data-augmentation techniques: ζj need not have the correct labels or targets. Directly training on them may have unforeseen consequences. For example, in the robust least squares experiment, we sampled the support ζi uniformly random from [−1, 1].

Robust learning under adversarial perturbation example. For the MNIST robust classification example, we used a neural network with two hidden layers with 64 units each. For the training of ERM and PGD, we used the ADAM optimization routine implemented in the PyTorch library. In Step 3 of SFG-DRO, we used random Fourier features (Rahimi and Recht, 2008) with 500 features. In Step 5 of SFG-DRO, we used the SA routine from CSA algorithm (Lan and Zhou, 2020). While other SA routines can be used, we prefer the simplicity of CSA in that it does not use a dual variable. We set the threshold and step-size of the CSA algorithm (Lan √1 and Zhou, 2020) to decay at the rate of k as suggested in that paper. We did not attempt further adaptive tuning of the step-sizes or the proposing distribution for ζ (we generate 3000 samples uniformly in Step 2 of SFG-DRO), which may further improve the performance. Parameter (weights of the neural nets) averaging is used for training all models. In the visualization of the predictions in Figure 3d, we perturbed the images by the PGD method (Madry et al., 2019; Madry) based on the ERM loss and linear model. SFG-DRO does not have the knowledge of the perturbation method.

C.4 Computing worst-case distributions

We have proposed Kernel DRO for making the decision θ via reformulation (4). In practice, it is often useful to find the worst-case distribution P ∗ (e.g., to study adversarial examples). We now propose two practical methods to compute P ∗ for a given θ, based on support sampling and perturbation, respectively. We illustrate the ideas in Figure 7. Support sampling. We consider the moment problem (11) where the distribution is restricted to discrete M 2 distributions supported on some sampled support {ζj}j=1 ⊆ X . For any given θ, M M N X X 1 X max αil(θ, ζj) subject to αiφ(ζj) − φ(ξi) ≤ . (23) α∈SM N j=1 j=1 i=1 H (23) can be written as a quadratically constrained program with linear objective, which admits a (strong) semidefinite program dual via what is historically known as the S-lemma (Pólik and Terlaky, 2007) (cf. appendix). Alternatively, (23) can be directly handled by convex solvers for a given θ. Note this approach was previously used in solving the problem of moments in (Zhu et al., 2020). Perturbation. Alternatively, we search for worst-case distributions that are perturbations of the empirical distribution. Let di ∈ X be some perturbation vector, given θ, N N 1 X 1 X max l(θ, ξi + di) subject to (φ(ξi + di) − φ(ξi)) ≤ . (24) di,i=1...N, N N i=1 i=1 H ξi+di∈X

2 M Note the sampled support {ζj }j=1 need not be real data; they are only the candidates for the worst-case support. The purpose is to make the the semi-infinite constraint approximately satisfied. See the appendix for more details. Kernel Distributionally Robust Optimization

Compared with (23), (24) directly searches for the support of the worst-case distribution. It can be interpreted as transporting the probability mass from empirical samples ξi to form the worst-case distribution. Depending on the kernel used, (24) may become a nonlinear program. However, its feasibility is guaranteed since it can always be initialized with a feasible solution di = 0. We now empirically examine the support sampling method (23) and perturbation method (24) to recover the worst-case distribution. Since both programs (23) and (24) search for the worst-case distribution within a subset of all distributions, their optimal values lower-bound the true worst-case risk (P) in (11), i.e., with finite samples, they are optimistic bound. Under the experimental setting as in Figure 3b, we ran Kernel DRO with fewer empirical samples (N = 5). After we obtain the Kernel DRO solution θ∗, we plug it into (23) and (24), respectively, to compute the worst-case distribution P ∗. Figure 7 plots the results. Note (23) is a convex optimization problem, while (24) results in a nonlinear program (with Gaussian kernel). Nonetheless, we solve it with an always-feasible initialization di = 0.

C.5 SDP dual via S-lemma

We consider a discretized version of the primal moment problem in (23) where the distribution is constrained to be a discrete distribution. We rewrite (23) as a quadratically constrained program using the plug-in estimator of MMD, M X max αil(ζi) α i=1 1 1 subject to α>K α − 2 α>K 1 + 1>K 1 ≤ 2 z N zx N 2 x M X αi = 1, αi ≥ 0, i = 1 ...M. i=1

This is a quadratically constrained linear objective convex optimization problem, where the Gram matrix Kz almost always has exponentially decaying eigenvalues. By applying S-lemma (Pólik and Terlaky, 2007), this program can be reformulated as the following SDP,

min t λ≥0,x,y≥0,t  1 1  (25) λP −λq − 2 (l + x · + y) subject to 1 1 > 2 ≥ 0, (−λq − 2 (l + x · + y)) t − λ + x + λr

1 > 1 > ˆ ˆ ˆ where P := Kz, q := N 1 Kzx, r := N 2 1 Kx1, and Kz = [k(ζi, ζj)]ij,Kzx = [k(ζi, ξj)]ij,Kx = [k(ξi, ξj)]ij, l = > [l(ζ1), . . . , l(ζM )] .

D SUPPORTING LEMMAS

We establish a few technical results that are used in the proofs.

D.1 Reducing conic constraint to infinite constraint

To derive the semi-infinite constraint in (12), we need a standard result from the literature of the moment problem. We give a self-contained proof below. ∗ ∗ Lemma D.1. Let K be the dual cone to the probability simplex P. The conic constraint l − f − f0 ∈ −K is equivalent to l(θ, ξ) ≤ f0 + f(ξ), ∀ξ ∈ X . (26)

Proof. “ =⇒ ”: Let us consider the set of all Dirac measures on X , D := {δξ : ξ ∈ X }. For any ξ ∈ X , we have Z l(ξ) − f0 − f(ξ) = l − f0 − fdδξ ≤ 0. Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf

Hence sufficiency. 0 R 0 “ ⇐= ”: Suppose there exists P ∈ co(P) such that l − f0 − fdP > 0. Without loss of generality, we assume P 0 ∈ P, or we can normalize it to be a probability measure. Then, Z 0 0 < l − f0 − fdP ≤ sup l(ξ) − f0 − f(ξ) ≤ 0. ξ∈X

The second inequality is due to that expectation is always less than or equal to the supremum. The last inequality ∗ holds because l is u.s.c. This double inequality is impossible, hence l − f0 − f ∈ −K .

Note an extension of this result to generating classes other than all Dirac measures D can be proved using Choquet theory, cf. (Shapiro et al., 2014, Proposition 6.66) (Popescu, 2005, Lemma 3.1), as well as in (Shapiro, 2001; Rogosinski).

D.2 Compactness of the ambiguity set

We now prove the compactness of the ambiguity set. We use the mean map notation T : P 7→ µP to denote a map between the space of P equipped with MMD, and H equipped with its norm. Let us denote the image of a subset K of measures under T by T (K) := {µP | P ∈ K} ⊆ H. If H is universal, then MMD is a metric. By the definition of MMD, T is an isometry (i.e., distance-preserving map) between P and H. Lemma D.2. T (P) is compact if X is compact.

Proof. If X is compact, by Prokhorov’s theorem P is compact. Since T is an isometry, T (P) is compact.

It is straightforward to verify that T (P) is convex.

Lemma D.3. Let CP = C ∩ T (P). If X is compact, under Assumption 3.1, Cp is compact.

Proof. By the Krein-Milman theorem, the convexity and compactness of T (P) (proved in the previous lemma) imply that T (P) is closed. By Assumption 3.1, C is closed, which results in the closedness of Cp. Since Cp is a closed subset of a compact set T (P), it is compact.

Recall that we denote the feasible set of probability measures, i.e., ambiguity set, for primal Kernel DRO (2) by R KC = {P : φ dP = µ, µ ∈ C,P ∈ P}. It is convex by straightforward verification. Let us derive the following compactness property of the ambiguity set.

Lemma D.4. If X is compact, under Assumption 3.1, KC is compact.

−1 Proof. We first note KC = T (Cp) and T is an isometric isomorphism (i.e., bijective isometry) between KC and Cp. Then KC is compact since Cp is compact.