Arxiv:2006.06981V3 [Math.OC] 27 Feb 2021 Rkhss, the Theorem Applies to a Broad Class Measure Is Only Partially Known Or Subject to Change
Total Page:16
File Type:pdf, Size:1020Kb
Kernel Distributionally Robust Optimization Jia-Jie Zhu Wittawat Jitkrittum Empirical Inference Department Empirical Inference Department Max Planck Institute for Intelligent Systems Max Planck Institute for Intelligent Systems Tübingen, Germany Tübingen, Germany [email protected] Currently at Google Research, NYC, USA [email protected] Moritz Diehl Bernhard Schölkopf Department of Microsystems Engineering Empirical Inference Department & Department of Mathematics Max Planck Institute for Intelligent Systems University of Freiburg Tübingen, Germany Freiburg, Germany [email protected] [email protected] Abstract 1 INTRODUCTION We propose kernel distributionally robust op- Imagine a hypothetical scenario in the illustrative figure timization (Kernel DRO) using insights from where we want to arrive at a destination while avoiding the robust optimization theory and functional unknown obstacles. A worst-case robust optimization analysis. Our method uses reproducing kernel (RO) (Ben-Tal et al., 2009) approach is then to avoid Hilbert spaces (RKHS) to construct a wide the entire unsafe area (left, blue). Suppose we have range of convex ambiguity sets, which can be historical locations of the obstacles (right, dots). We generalized to sets based on integral probabil- may choose to avoid only the convex polytope that ity metrics and finite-order moment bounds. contains all the samples (pink). This data-driven robust This perspective unifies multiple existing ro- decision-making idea improves efficiency while retaining bust and stochastic optimization methods. robustness. We prove a theorem that generalizes the clas- sical duality in the mathematical problem of moments. Enabled by this theorem, we re- formulate the maximization with respect to measures in DRO into the dual program that The concept of distributional ambiguity concerns the searches for RKHS functions. Using universal uncertainty of uncertainty — the underlying probability arXiv:2006.06981v3 [math.OC] 27 Feb 2021 RKHSs, the theorem applies to a broad class measure is only partially known or subject to change. of loss functions, lifting common limitations This idea is by no means a new one. The classical such as polynomial losses and knowledge of moment problem concerns itself with estimating the R the Lipschitz constant. We then establish a worst-case risk expressed by maxP 2K l dP where l is connection between DRO and stochastic op- some loss function. The constraint P 2 K describes timization with expectation constraints. Fi- the distribution ambiguity, i.e., P is only known to live nally, we propose practical algorithms based within a subset K of probability measures. The solution on both batch convex solvers and stochastic to the moment problem gives the risk under some functional gradient, which apply to general worst-case distribution within K. To make decisions optimization and machine learning tasks. that will minimize this worst-case risk is the idea of distributionally robust optimization (DRO) (Delage and This work appeared in the proceedings of the 24th International Conference on Artificial Intelligence and Ye, 2010; Scarf, 1958). Statistics (AISTATS) 2021, San Diego, California, USA. Many of today’s learning tasks suffer from various man- PMLR: Volume 130. ifestations of distributional ambiguity — e.g., covariate Kernel Distributionally Robust Optimization ∗ shift, adversarial attacks, simulation to reality trans- tion. δC(f) := supµ2Chf; µiH is the support function of fer — phenomena that are caused by the discrepancy C. SN denotes the N-dimensional simplex. ri(·) denotes between training and test distributions. Kernel meth- the relative interior of a set. A function f is upper semi- ods are known to possess robustness properties, e.g., continuous on X if lim supx!x0 f(x) ≤ f(x0); 8x0 2 X ; (Christmann and Steinwart, 2007; Xu et al., 2009). it is proper if it is not identically −∞. When there is no However, this robustness only applies to kernelized ambiguity, we simplify the loss function notation l(θ; ·) models. This paper extends the robustness of kernel by using l to indicate that results hold for θ point-wise. methods using the robust counterpart formulation tech- niques (Ben-Tal et al., 2009) as well as the principled 2.1 Robust and distributionally robust conic duality theory (Shapiro, 2001). We term our optimization approach kernel distributionally robust optimization (Kernel DRO), which can robustify general optimiza- Robust optimization (RO) (Ben-Tal et al., 2009) stud- tion solutions not limited to kernelized models. ies mathematical decision-making under uncertainty. It solves the min-max problem (omitting constraints) The main contributions of this paper are: minθ supξ2X l(θ; ξ); where l(θ; ξ) denotes a general loss 1. We rigorously prove the generalized duality theo- function, θ is the decision variable, and ξ is a variable rem (Theorem 3.1) that reformulates general DRO representing the uncertainty. Intuitively, RO makes into a convex dual problem searching for RKHS the decision assuming an adversarial scenario, as re- functions, lifting common limitations of DRO on flected in taking supremum w.r.t. ξ. For this reason, the loss functions, such as the knowledge of Lip- it is often referred to as the worst-case RO. Recently, schitz constant. The theorem also constitutes a RO has been applied to the setting of adversarially generalization of the duality results from the liter- robust learning, e.g., in (Madry et al., 2019; Wong and ature of mathematical problem of moments. Kolter, 2018), which we will visit in this paper. In the 2. We use RKHSs to construct a wide range of convex optimization literature, a typical approach to solving ambiguity sets (in Table 1, 3), including sets based RO is via reformulating the min-max program using on integral probability metrics (IPM) and finite- duality to obtain a single minimization problem. In order moment bounds. This perspective unifies contrast, distributionally robust optimization (DRO) existing RO and DRO methods. minimizes the expected loss assuming the worst-case 3. We propose computational algorithms based on distribution: both convex solvers and stochastic approximation, Z which can be applied to robustify general optimiza- min sup l(θ; ξ) dP (ξ) ; (1) θ P 2K tion and machine learning models not limited to kernelized or known-Lipschitz-constant ones. where K ⊆ P, called the ambiguity set, is a subset 4. Finally, we establish an explicit connection be- of distributions, e.g., all distributions with the given tween DRO and stochastic optimization with mean and variance. Compared with RO, DRO only expectation constraints. This leads to a novel robustifies the solution against a subset K of distribu- stochastic functional gradient DRO (SFG-DRO) tions on X and is, therefore, less conservative (since R algorithm which can scale up to modern machine supP 2Kf l(θ; ξ) dP (ξ)g ≤ supξ2X l(θ; ξ)). learning tasks. The inner problem of (1), historically known as the In addition, we give complete self-contained proofs in problem of moments traced back at least to Thomas the appendix that shed light on the connection be- Joannes Stieltjes, estimates the worst-case risk under tween RKHSs, conic duality, and DRO. We also show uncertainty in distributions. The modern approaches, that universal RKHSs are large enough for DRO from pioneered by the work of (Isii, 1962) (see also (Lasserre, the perspective of functional analysis through concrete 2002; Shapiro, 2001; Bertsimas and Popescu, 2005; examples. Popescu, 2005; Vandenberghe et al., 2007; Van Parys et al., 2016; Zhu et al., 2020)), typically seek a sharp 2 BACKGROUND upper bound via duality. This duality, rigorously jus- tified in (Shapiro, 2001), is different from that in the Euclidean space because infinite-dimensional convex Notation. X ⊂ Rd denotes the input domain, which sets can become pathological. Using that methodol- is assumed to be compact unless otherwise specified. ogy, we can reformulate DRO (1) into a single solvable P := P(X ) denotes the set of all Borel probability minimization problem. measures on X . We use P^ to denote the empirical dis- ^ PN 1 tribution P = i=1 N δξi , where δ is a Dirac measure Existing DRO approaches can be grouped into three N and fξigi=1 are data samples. We refer to the function main categories by the type of ambiguity sets used. δC(x) := 0 if x 2 C, 1 if x2 = C, as the indicator func- DRO with (finite-order) moment constraints has been Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf studied in (Delage and Ye, 2010; Scarf, 1958; Zymler 2012). With a universal H, given two distributions P; Q, et al., 2013). The authors of (Ben-Tal et al., 2013; kµP − µQkH defines a metric. This quantity is known Iyengar, 2005; Nilim and El Ghaoui, 2005; Wang et al., as the maximum mean discrepancy (MMD) (Gretton p 2016; Duchi et al., 2016) studied DRO using likelihood et al., 2012). With kfkH := hf; fiH and the repro- 2 bounds as well as φ−divergence. Wasserstein-distance- ducing property, it can be shown that kµP − µQkH = 0 0 based DRO has been studied by the authors of (Mo- Ex;x0∼P k(x; x ) + Ey;y0∼Qk(y; y ) − 2Ex∼P;y∼Qk(x; y), hajerin Esfahani and Kuhn, 2018; Zhao and Guan, allowing the plug-in estimator to be used for esti- 2018; Gao and Kleywegt, 2016; Blanchet et al., 2019), mating the MMD from empirical data. The MMD and applied in a large body of literature. Many exist- is an instance of the class of integral probability ing approaches require either the assumptions such as metrics (IPMs), and can equivalently be written as R quadratic loss functions or the knowledge of Lipschitz kµP − µQkH = supkfkH≤1 f d(P − Q), where the op- constant or RKHS norm of the loss l, which are often timum f ∗ is a witness function (Gretton et al., 2012; hard to obtain in practice; see (Virmaux and Scaman, Sriperumbudur et al., 2012).