Kernel Distributionally Robust Optimization: Generalized Duality Theorem and Stochastic Approximation
Total Page:16
File Type:pdf, Size:1020Kb
Kernel Distributionally Robust Optimization: Generalized Duality Theorem and Stochastic Approximation Jia-Jie Zhu Wittawat Jitkrittum Empirical Inference Department Empirical Inference Department Max Planck Institute for Intelligent Systems Max Planck Institute for Intelligent Systems Tübingen, Germany Tübingen, Germany [email protected] Currently at Google Research, NYC, USA [email protected] Moritz Diehl Bernhard Schölkopf Department of Microsystems Engineering Empirical Inference Department & Department of Mathematics Max Planck Institute for Intelligent Systems University of Freiburg Tübingen, Germany Freiburg, Germany [email protected] [email protected] Abstract functional gradient, which apply to general optimization and machine learning tasks. We propose kernel distributionally robust op- timization (Kernel DRO) using insights from 1 INTRODUCTION the robust optimization theory and functional analysis. Our method uses reproducing kernel Imagine a hypothetical scenario in the illustrative figure Hilbert spaces (RKHS) to construct a wide where we want to arrive at a destination while avoiding range of convex ambiguity sets, which can be unknown obstacles. A worst-case robust optimization generalized to sets based on integral probabil- (RO) (Ben-Tal et al., 2009) approach is then to avoid ity metrics and finite-order moment bounds. the entire unsafe area (left, blue). Suppose we have This perspective unifies multiple existing ro- historical locations of the obstacles (right, dots). We bust and stochastic optimization methods. may choose to avoid only the convex polytope that We prove a theorem that generalizes the clas- contains all the samples (pink). This data-driven robust sical duality in the mathematical problem of decision-making idea improves efficiency while retaining moments. Enabled by this theorem, we re- robustness. formulate the maximization with respect to measures in DRO into the dual program that searches for RKHS functions. Using universal RKHSs, the theorem applies to a broad class of loss functions, lifting common limitations The concept of distributional ambiguity concerns the such as polynomial losses and knowledge of uncertainty of uncertainty — the underlying probability the Lipschitz constant. We then establish a measure is only partially known or subject to change. connection between DRO and stochastic op- This idea is by no means a new one. The classical timization with expectation constraints. Fi- moment problem concerns itself with estimating the R nally, we propose practical algorithms based worst-case risk expressed by maxP 2K l dP where l is on both batch convex solvers and stochastic some loss function. The constraint P 2 K describes the distribution ambiguity, i.e., P is only known to live Proceedings of the 24th International Conference on Artifi- within a subset K of probability measures. The solution cial Intelligence and Statistics (AISTATS) 2021, San Diego, to the moment problem gives the risk under some California, USA. PMLR: Volume 130. Copyright 2021 by worst-case distribution within K. To make decisions the author(s). that will minimize this worst-case risk is the idea of Kernel Distributionally Robust Optimization distributionally robust optimization (DRO) (Delage and P := P(X ) denotes the set of all Borel probability Ye, 2010; Scarf, 1958). measures on X . We use P^ to denote the empirical dis- ^ PN 1 tribution P = i=1 N δξi , where δ is a Dirac measure Many of today’s learning tasks suffer from various man- N and fξig are data samples. We refer to the function ifestations of distributional ambiguity — e.g., covariate i=1 δC(x) := 0 if x 2 C, 1 if x2 = C, as the indicator func- shift, adversarial attacks, simulation to reality trans- ∗ tion. δ (f) := sup hf; µiH is the support function of fer — phenomena that are caused by the discrepancy C µ2C C. SN denotes the N-dimensional simplex. ri(·) denotes between training and test distributions. Kernel meth- the relative interior of a set. A function f is upper semi- ods are known to possess robustness properties, e.g., continuous on X if lim sup f(x) ≤ f(x0); 8x0 2 X ; (Christmann and Steinwart, 2007; Xu et al., 2009). x!x0 it is proper if it is not identically −∞. When there is no However, this robustness only applies to kernelized ambiguity, we simplify the loss function notation l(θ; ·) models. This paper extends the robustness of kernel by using l to indicate that results hold for θ point-wise. methods using the robust counterpart formulation tech- niques (Ben-Tal et al., 2009) as well as the principled 2.1 Robust and distributionally robust conic duality theory (Shapiro, 2001). We term our optimization approach kernel distributionally robust optimization (Kernel DRO), which can robustify general optimiza- Robust optimization (RO) (Ben-Tal et al., 2009) stud- tion solutions not limited to kernelized models. ies mathematical decision-making under uncertainty. The main contributions of this paper are: It solves the min-max problem (omitting constraints) minθ sup l(θ; ξ); where l(θ; ξ) denotes a general loss 1. We rigorously prove the generalized duality theo- ξ2X function, θ is the decision variable, and ξ is a variable rem (Theorem 3.1) that reformulates general DRO representing the uncertainty. Intuitively, RO makes into a convex dual problem searching for RKHS the decision assuming an adversarial scenario, as re- functions, lifting common limitations of DRO on flected in taking supremum w.r.t. ξ. For this reason, the loss functions, such as the knowledge of Lip- it is often referred to as the worst-case RO. Recently, schitz constant. The theorem also constitutes a RO has been applied to the setting of adversarially generalization of the duality results from the liter- robust learning, e.g., in (Madry et al., 2019; Wong and ature of mathematical problem of moments. Kolter, 2018), which we will visit in this paper. In the 2. We use RKHSs to construct a wide range of convex optimization literature, a typical approach to solving ambiguity sets (in Table 1, 3), including sets based RO is via reformulating the min-max program using on integral probability metrics (IPM) and finite- duality to obtain a single minimization problem. In order moment bounds. This perspective unifies contrast, distributionally robust optimization (DRO) existing RO and DRO methods. minimizes the expected loss assuming the worst-case 3. We propose computational algorithms based on distribution: both convex solvers and stochastic approximation, which can be applied to robustify general optimiza- Z min sup l(θ; ξ) dP (ξ) ; (1) tion and machine learning models not limited to θ P 2K kernelized or known-Lipschitz-constant ones. 4. Finally, we establish an explicit connection be- where K ⊆ P, called the ambiguity set, is a subset tween DRO and stochastic optimization with of distributions, e.g., all distributions with the given expectation constraints. This leads to a novel mean and variance. Compared with RO, DRO only stochastic functional gradient DRO (SFG-DRO) robustifies the solution against a subset K of distribu- algorithm which can scale up to modern machine tions on X and is, therefore, less conservative (since R learning tasks. supP 2Kf l(θ; ξ) dP (ξ)g ≤ supξ2X l(θ; ξ)). In addition, we give complete self-contained proofs in The inner problem of (1), historically known as the the appendix that shed light on the connection be- problem of moments traced back at least to Thomas tween RKHSs, conic duality, and DRO. We also show Joannes Stieltjes, estimates the worst-case risk under that universal RKHSs are large enough for DRO from uncertainty in distributions. The modern approaches, the perspective of functional analysis through concrete pioneered by the work of (Isii, 1962) (see also (Lasserre, examples. 2002; Shapiro, 2001; Bertsimas and Popescu, 2005; Popescu, 2005; Vandenberghe et al., 2007; Van Parys et al., 2016; Zhu et al., 2020)), typically seek a sharp 2 BACKGROUND upper bound via duality. This duality, rigorously jus- tified in (Shapiro, 2001), is different from that in the Notation. X ⊂ Rd denotes the input domain, which Euclidean space because infinite-dimensional convex is assumed to be compact unless otherwise specified. sets can become pathological. Using that methodol- Jia-Jie Zhu, Wittawat Jitkrittum, Moritz Diehl, Bernhard Schölkopf ogy, we can reformulate DRO (1) into a single solvable 2007, Section 1.2). The reproducing property allows minimization problem. one to easily compute the expectation of any function f 2 H since Ex∼P [f(x)] = hf; µP iH. Embedding distri- Existing DRO approaches can be grouped into three butions into H also allows one to measure the distance main categories by the type of ambiguity sets used. between distributions in H. If k is universal, then the DRO with (finite-order) moment constraints has been mean map P 7! µP is injective on P (Gretton et al., studied in (Delage and Ye, 2010; Scarf, 1958; Zymler 2012). With a universal H, given two distributions P; Q, et al., 2013). The authors of (Ben-Tal et al., 2013; kµP − µQkH defines a metric. This quantity is known Iyengar, 2005; Nilim and El Ghaoui, 2005; Wang et al., as the maximum mean discrepancy (MMD) (Gretton 2016; Duchi et al., 2016) studied DRO using likelihood p et al., 2012). With kfkH := hf; fiH and the repro- bounds as well as φ−divergence. Wasserstein-distance- 2 ducing property, it can be shown that kµP − µQkH = based DRO has been studied by the authors of (Mo- 0 0 Ex;x0∼P k(x; x ) + Ey;y0∼Qk(y; y ) − 2Ex∼P;y∼Qk(x; y), hajerin Esfahani and Kuhn, 2018; Zhao and Guan, allowing the plug-in estimator to be used for esti- 2018; Gao and Kleywegt, 2016; Blanchet et al., 2019), mating the MMD from empirical data. The MMD and applied in a large body of literature. Many exist- is an instance of the class of integral probability ing approaches require either the assumptions such as metrics (IPMs), and can equivalently be written as quadratic loss functions or the knowledge of Lipschitz R kµP − µQkH = sup f d(P − Q), where the op- constant or RKHS norm of the loss l, which are often kfkH≤1 timum f ∗ is a witness function (Gretton et al., 2012; hard to obtain in practice; see (Virmaux and Scaman, Sriperumbudur et al., 2012).