<<

Uniform Convergence of Rank-weighted Learning

Justin Khim 1 Liu Leqi 1 Adarsh Prasad 1 Pradeep Ravikumar 1

Abstract learning, where percentile based risk measures have been used to quantify the tail-risk of models. A recent line of The decision-theoretic foundations of classical work borrows classical ideas from behavioral economics machine learning models have largely focused for use in machine learning to make models more human- on estimating model parameters that minimize aligned. In particular, Prashanth et al.(2016) have brought the expectation of a given loss . How- ideas from cumulative prospect theory (Tversky & Kahne- ever, as machine learning models are deployed in man, 1992) into reinforcement learning and bandits and Leqi varied contexts, such as in high-stakes decision- et al.(2019) have used cumulative prospect theory to intro- making and societal settings, it is clear that these duce the notion of human-aligned risk measures. models are not just evaluated by their average per- formances. In this work, we propose and study A common theme in these prior works is the notion of rank- a novel notion of L-Risk based on the classical weighted risks. The aforementioned risk measures weight idea of rank-weighted learning. These L-Risks, each loss by its relative rank, and are based upon the clas- induced by rank-dependent weighting functions sical rank-dependent expected utility theory (Diecidue & with , is a unification of pop- Wakker, 2001). These rank-dependent utilities have also ular risk measures such as conditional value-at- been used in several different contexts in machine learn- risk and those defined by cumulative prospect ing. For example, such rank-weighted risks have been used theory. We give uniform convergence bounds of to speed up training of deep networks (Jiang et al., 2019). this broad class of risk measures and study their They have also played a key role in designing estimators consequences on a logistic regression example. which are robust to outliers in data. In particular, trimming procedures that simply throw away data with high losses have been used to design estimators that are robust to out- 1. Introduction liers (Daniell, 1920; Bhatia et al., 2015; Lai et al., 2016; Prasad et al., 2018; Lugosi & Mendelson, 2019; Prasad The statistical decision-theoretic foundations of modern ma- et al., 2019; Shah et al., 2020). chine learning have largely focused on solving tasks by minimizing the expectation of some loss function. This While these rank-based objectives have found widespread ensures that the resulting models have high average case use in machine learning, establishing their statistical prop- performance. However, as machine learning models are erties has remained elusive. On the other hand, we have deployed along side humans in decision-making, it is clear a clear understanding of the generalization properties for that they are not just evaluated by their average case per- average-case risk measures. Loosely speaking, given a col- formance but also properties like fairness. There has been lection of models and finite training data, and suppose we increasing interest in capturing these additional properties choose a model by minimizing average training error, then, via appropriate modifications of the classical objective of ex- we can roughly guarantee on how well this chosen model pected loss (Duchi et al., 2019; Garcıa & Fernandez´ , 2015; performs in an average sense. Such guarantees are typically Sra et al., 2012). obtained by studying uniform convergence of average-case risk measures. In parallel, there is a long line of work exploring alter- natives to expected loss based risk measures in decision- However, as noted before, uniform convergence of rank- making (Howard & Matheson, 1972), and in reinforcement weighted risk measures have not been explored in detail. This difficulty comes from the weights being dependent 1Machine Learning Department, Carnegie Mellon Uni- on the whole data, thereby, inducing complex dependen- versity, Pittsburgh, PA. Correspondence to: Justin Khim cies. Hence, existing work on generalization has been on . the weaker notion of pointwise concentration bounds (Bhat, Proceedings of the 37 푡ℎ International Conference on Machine 2019; Duchi & Namkoong, 2018) or have focused on spe- Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- cific forms of rank-weighted risk measures such as condi- thor(s). Rank-weighted Learning tional value-at-risk (CVaR) (Duchi & Namkoong, 2018). where 푤 : [0, 1] ↦→ [0, ∞) is a scoring function.

Contributions. In this work, we propose the study of With the above notion of an empirical L-statistic at hand, rank-weighted risks and prove uniform convergence results. we define the notion of empirical L-risk for any function 푓 In particular, we propose a new notion of L-Risk in Section2 by simply replacing the empirical average of the losses with that unifies existing rank-dependent risk measures including their corresponding L-statistic. CVaR and those defined by cumulative prospect theory. In Definition 2. The empirical L-risk of 푓 is Section3, we present uniform convergence results, and we 푛 1 ∑︁ (︂ 푖 )︂ observe that the learning rate depends on the weighting func- ℒℛ (푓) = 푤 ℓ (푓), 푤,푛 푛 푛 (푖) tion. In particular, when the weighting function is Lipschitz, 푖=1 we recover the standard 푂(푛−1/2) convergence rate. Finally, we instantiate our result on logistic regression in Section4 where ℓ(1)(푓) ≤ ... ≤ ℓ(푛)(푓) are the order of and empirically study the convergence performance of the the sample losses ℓ(푓, 푍1), . . . ℓ(푓, 푍푛). L-Risks in Section6. Note that the empirical L-risk can be alternatively written in terms of the empirical cumulative distribution function 푛 2. Background and Problem Setup 퐹푓,푛 of the sample losses {ℓ(푓, 푍푖)}푖=1: 푛 In this section, we provide the necessary background on 1 ∑︁ ℒℛ = ℓ(푓, 푍 )푤(퐹 (ℓ(푓, 푍 ))). (1) rank-weighted risk minimization, and introduce the notion 푤,푛 푛 푖 푓,푛 푖 of bounded variation functions that we consider in this work. 푖=1 Additionally, we provide standard learning-theoretic defini- Accordingly, the population L-risk for any function 푓 can tions and notation before moving on to our main results. be defined as:

2.1. L-Risk Estimation ℒℛ푤(푓) = E푍∼푃 [ℓ(푓, 푍)푤(퐹푓 (ℓ(푓, 푍))] , (2)

We assume that there is a joint distribution where 퐹푓 (·) is the cumulative distribution function of 푃 (푋, 푌 ) over the space 풵 = 풳 ×풴 and our goal is to learn ℓ(푓, 푍) for 푍 drawn from 푃 . a function 푓 : 풳 ↦→ 풴. In the standard decision-theoretic * framework, 푓 is chosen among a class of functions ℱ using 2.2. Illustrative Examples of L-Risk a non-negative loss function ℓ : ℱ × 풵 ↦→ R+. The framework of risk minimization is a central paradigm Classical Risk Estimation. In the traditional setting of of statistical estimation and is widely applicable. In this risk minimization, the population risk of a function 푓 is section, we provide illustrative examples that L-risk gener- given by the expectation of the loss function ℓ(푓, 푍) when alizes classical risk and encompasses several other notions 푍 is drawn according to the distribution 푃 : of risk measures. To begin with, observe that simply - ting the weighting function as 푤(푡) = 1 for all 푡 ∈ [0, 1], ℛ(푓) = E푍∼푃 [ℓ(푓, 푍)]. L-Risk minimization corresponds to the classical empirical 푛 risk estimation. Given 푛 i.i.d. samples {푍푖}푖=1, empirical risk minimization substitutes the empirical risk for the population risk in the risk minimization objective: Conditional Value-at-Risk (CVaR). As noted in Sec- 푛 tion1, in settings where low-probability events have catas- 1 ∑︁ ℛ (푓) = ℓ(푓, 푍 ). trophic losses, using classical risk is inappropriate. Condi- 푛 푛 푖 푖=1 tional value-at-risk was introduced to handle such tail events and measures the expected loss when conditioned on the L-Risk. As noted in the Section1, there are many sce- event that the loss exceeds a certain threshold. Moreover, narios in machine learning, where we want to evaluate a CVaR has several desirable properties as a risk and function by other metrics apart from average loss. To this in particular, is convex and coherent (Krokhmal et al., 2013). end, we first define the notion of an L-Statistic which dates Hence, CVaR is studied across a number of fields such as back to the classical work of (Daniell, 1920). mathematical finance (Rockafellar et al., 2000), decision

Definition 1. Let 푋(1) ≤ 푋(2) ... ≤ 푋(푛) be the order making, and more recently machine learning (Duchi et al., statistics of the sample 푋1, 푋2, . . . 푋푛. Then, the L-statistic 2019). Formally, the CVaR of a function 푓 at a confidence is a of order statistics, level 1 − 훼 is defined as, 푛 (︂ )︂ 1 ∑︁ 푖 ℛCVaR,훼(푓) = E푍∼푃 [ℓ(푓, 푍)|ℓ(푓, 푍) ≥ VaR훼(ℓ(푓, 푍))], 푇 ({푋 }푛 ) = 푤 푋 , 푤 푖 푖=1 푛 푛 (푖) (3) 푖=1 Rank-weighted Learning where VaR훼(ℓ(푓, 푍)) = inf 푥 is the value-at-risk. 푥:퐹푓 (푥)≥1−훼 Observe that CVaR is a special case L-Risk in (2) and can 1 be obtained by choosing 푤(푡) = 훼 1{푡 ≥ 1 − 훼}, where 1{·} is the indicator function.

Human-Aligned Risk (HRM). Cumulative prospect the- ory (CPT), which is motivated by empirical studies of hu- Figure 1. An illustration of the partition argument. Here, we have man decision-making from behavioral economics (Tversky Δ푖 = 휀푖 and ‖Δ‖∞ = 휀. The blue points, {(푖+1)/푛, (푖+1)/푛+ & Kahneman, 1992), has recently been studied in machine 휀푖+1}, and the red points, {푖/푛, 푖/푛 + 휀푖, (푖 + 2)/푛} are in two learning (Prashanth et al., 2016; Gopalan et al., 2017; Leqi separate partitions. We construct the partitions this way to ensure et al., 2019). In particular, Leqi et al.(2019) proposed the that 휀푖+2 and (푖 + 2)/푛 are not in between 푖/푛 and 푖/푛 + 휀푖 while the order of 푖/푛, 휀 , (푖 + 1)/푛, and 휀 do not matter because following human-aligned risk objective, 푖 푖+1 푖/푛 and (푖 + 1)/푛 are in separate partitions.

ℛHRM,푎,푏(푓) = E푍∼푃 [ℓ(푓, 푍)푤a,b(퐹푓 (ℓ(푓, 푍))], When 푝 = 1, the variation 푉 (푓) = 푉 (푓) is also called 3−3푏 (︀ 2 )︀ 푝 1 where 푤a,b(푡) = 푎2−푎+1 3푡 − 2(푎 + 1)푡 + 푎 + 1. the total variation. Moreover, when the weighting function is 휆-Lipschitz, then for all 푝 ≥ 1, the 푝-variation is upper Trimmed Risk. Trimmed is a measure of the central bounded by 휆. tendency of a distribution and is calculated by discarding samples that are above and below a certain threshold and Table1 summarizes the bounded variation constants for using the remaining samples to calculate the remaining sam- the aforementioned scoring function. The proofs for these ple mean. It is known to be more robust to outliers and claims can be found in the Appendix. Moving forward, we heavy-tails and is widely used across a variety of disciplines work with the assumption the weighting function 푤(·) has such as finance and aggregating scores in sports. Finally, bounded variation. the trimmed risk of a function 푓 at trimming level 훼 can be Assumption 1. The weighting function 푤 has bounded vari- defined as, ation 푉푝(푤) for 푝 = 1, 2.

ℛ ,훼(푓) = 푍∼푃 [ℓ(푓, 푍)|퐹푓 (ℓ(푓, 푍)) ∈ [훼, 1 − 훼]]. TRIM E Stability to ℓ∞ Perturbations. Under Assumption1, we next present a deterministic bound which controls the stabil- The trimmed risk is also a special of L-Risk in (2) and is ity of bounded-variation functions to ℓ perturbations. obtained by setting 푤(푡) = 1 1{훼 ≤ 푡 ≤ 1 − 훼}. ∞ 1−2훼 * 푖 푛 Lemma 1. Let 푃 = { 푛 }푖=1 be the 푛-sized equally spaced 2.3. Bounded Variation Weighting Functions partition of the interval [0, 1]. Then, for any perturbed 푖 푛 partition 푃̃︀ = { 푛 + ∆푖}푖=1 such that sup푖 |∆푖| = ‖∆‖∞, Recall from the previous section that the L-risk for any we have function 푓 depends crucially on the weighting function 푤(·). 푛 ⃒ (︂ )︂ (︂ )︂⃒ Moreover, for popular risk measures such as CVaR and 1 ∑︁ ⃒ 푖 푖 ⃒ ⃒푤 −푤 + ∆푖 ⃒ ≤ ⌈2푛‖∆‖∞⌉푉1(푤). Trimmed Risk, this weighting function is not differentiable 푛 ⃒ 푛 푛 ⃒ 푖=1 and Lipschitz. In this section, we formally introduce a class of weighting functions called bounded variation functions, which can be viewed as a strict generalization of Lipschitz Proof Sketch. The key idea is that we need to construct functions (Carothers, 2000; Musielak & Orlicz, 1959). More sufficiently spaced-out partitions 풫 so that for all 푖 ∈ [푛], formally, we define 푝-variation as: there exists a partition 푃 ∈ 풫 such that both 푖/푛 and 푖/푛 + Definition 3. Let 푓 : [0, 1] → R be a function. Let 푃 = ∆푖 are in 푃 and no points in between 푖/푛 and 푖/푛 + ∆푖 {푥1, . . . , 푥푁 } ⊂ [0, 1] be a partition of [0, 1]. Without loss are in 푃 . Since for all 푖 ∈ [푛], we know that 푖/푛 + ∆푖 ∈ of generality, we assume 푥1 ≤ ... ≤ 푥푁 . Let 풫 be the (푖/푛 − ‖∆‖∞, 푖/푛 + ‖∆‖∞), it suffices to have partitions set of all partitions of [0, 1], where the size of the partitions that are 2‖∆‖∞ spaced-out. This implies that the total may vary between elements of 풫. The 푝-variation of 푓 with number of partitions we need is at least 2‖Δ‖∞ . Since 푤 has respect to a partition 푃 is 1/푛 1-bounded variation 푉1(푤), we have reached the desired 1 result. Figure1 shows the necessity of having spaced-out (︃푁−1 )︃ 푝 ∑︁ 푝 partitions. 푉푝(푓, 푃 ) = |푓(푥푖+1) − 푓(푥푖)| . 푖=1 The above result is a key tool for studying uniform con- The 푝-variation of 푓 is 푉푝(푓) = sup푃 ∈풫 푉푝(푓, 푃 ). vergence of L-Risks with weighting function 푤 that have Rank-weighted Learning

Table 1. Bounded variation for different weight functions.

L-Risk 푤(푡) 푉1(푤) 푉2(푤) ℛ 1 0 0 1 1 1 ℛCVaR,훼 훼 1{1 − 훼 ≤ 푡} 훼 훼 ℛ 3−3푏 (︀3푡2 − 2(푎 + 1)푡 + 푎)︀ + 1 6(1−푏)(2−푎) 6(1−푏)(2−푎) HRM,푎,푏 푎2−푎+1 푎2−푎+1 푎2−√푎+1 1 2 2 ℛTRIM,훼 1−2훼 1{훼 ≤ 푡 ≤ 1 − 훼} 1−2훼 1−2훼 bounded variations. It is novel to the best of our knowledge Finally, we discuss VC dimension. Let 퐺 be a class of and may be of independent interest. functions from 풳 to some finite set, e.g., {0, 1}. We define the growth function Π퐺 : N → N by 2.4. Rademacher Complexity, Covering Numbers, and VC Dimension Π퐺(푛) = max |{(푔(푥1), . . . , 푔(푥푛)) : 푔 ∈ 퐺}| . 푥1,...,푥푛∈풳 Finally, we discuss notions of function class complexity that In words, this is the maximum number of ways that func- we shall use in our results. We make use of Rademacher tions in 퐺 may classify 푛 points. Now, suppose that func- complexity, covering numbers, and VC dimension. The rea- tions in 퐺 map to a set of two classes, such as {0, 1}. Then, son for all of these is that Rademacher complexity is often a set 푆 = (푥1, . . . , 푥푛) of 푛 points in is said to be shattered an intermediate step but is difficult to analyze. Covering by 퐺 if there are functions in 퐺 realizing all possible label number bounds can be used to help untangle the product assignments, i.e., structure of the L-Risks, but ultimately to deal with an em- 푛 pirical cumulative distribution function, it is simpler to use Π퐺(푛) = 2 = |{(푔(푥1), . . . , 푔(푥푛)) : 푔 ∈ 퐺}| . VC dimension. Finally, the VC-dimension of 퐺, which we denote by 휎 , . . . , 휎 We start with Rademacher complexity. Let 1 푛 be VC(퐺), is given by i.i.d Rademacher random variables, i.e., random variables 푛 that take the values +1 and −1 each with probability 1/2. VC(퐺) = max {푛 :Π퐺(푛) = 2 } . (5) The empirical Rademacher complexity of a function class ℱ given a sample 푆 = 푍1, . . . , 푍푛 is In words, if the VC-dimension of 퐺 is 푉 , then there is some set of 푉 points shattered by 퐺. The set that we shall be 푛 1 ∑︁ most interested in using with VC-dimension is Rˆ (ℱ) = sup 휎 푓(푍 ). (4) 푛 E휎 푛 푖 푖 푓∈ℱ 푖=1 ℒ := {푔푓,푡 : 풵 → {0, 1}|푔푓,푡(푧) = 1 {ℓ(푓, 푧) ≤ 푡} The Rademacher complexity of ℱ is then the expectation of for 푓 ∈ ℱ, 푡 ∈ [0, 퐵]}. the empirical Rademacher complexity with respect to the sample 푆, i.e. Thus, the VC-dimension depends on both the function class ℱ and the loss function. We note that this is not too large ˆ R푛(ℱ) = E푆R푛(ℱ). for logistic regression in Lemma5, and for in R푑 with squared error loss, the VC dimension is upper bounded by a constant times 푑 (Akama et al., 2010). This Given a class of functions ℱ : 풳 → , a finite collection R VC-dimension plays a key role in the following assumption. of functions 푓1, . . . , 푓푁 mapping from 풳 to R is called an 휀-cover for ℱ with respect to a semi-norm ‖·‖ if for Assumption 2. Assume that every 푓 in ℱ, we have min ‖푓 − 푓 ‖ ≤ 휀. The 휀- 푗=1,...,푁 푗 sup sup |퐹 (푥) − 퐹 (푥)| ≤ 휀. covering number of ℱ with respect to ‖·‖ is the size of the 푓,푛 푓 푓∈ℱ 푥∈[0,퐵ℓ] smallest 휀-cover of ℱ and is denoted by 풩 (휀, ℱ, ‖·‖). Note that often bounds are stated in terms of log 풩 (휀, ℱ, ‖·‖), which is also called the metric entropy of ℱ. One particular We make a few remarks related to this assumption. First, norm of interest is the empirical 2-norm. Given 푛 samples it can be thought of as a Glivenko-Cantelli theorem-like 푋1, . . . , 푋푛, define the empirical 2-norm of 푓 to be assumption, except here we require uniformity over ℱ. Sec- ⎯ ond, our main results relying on variation use this assump- ⎸ 푛 tion in two places. Third, the assumption can be shown ⎸ 1 ∑︁ 2 ‖푓‖푛 = ⎷ 푓(푋푖) . −1/2 푛 to hold with high probability for 휀 = 푂(푛 ) when ℒ 푖=1 has bounded VC-dimension via a standard symmetrization Rank-weighted Learning and VC-dimension upper bound argument. We state this number into a product of covering numbers. sufficient condition as an alternative assumption. Lemma 2. Suppose that the loss function ℓ is 휆ℓ-Lipschitz Assumption 2′ (sufficient condition). The function class ℒ in 푓(푋) and bounded by 퐵ℓ. Additionally, assume that 푤 has bounded VC-dimension. is bounded by 퐵푤. Then, we have

Consequently, it is natural to wonder whether this can be 풩 (푡, ℓ(ℱ) · 푤(퐹ℱ ), ‖·‖푛) relaxed into a statement about the VC-dimension of ℱ. Un- (︂ 푡 )︂ (︂ 푡 )︂ fortunately, this result depends on the loss function; so such ≤ 풩 , ℱ, ‖·‖푛 · 풩 , 푤(퐹ℱ ), ‖·‖푛 . 2퐵푤휆ℓ 2퐵ℓ a general result is unknown to the best of our knowledge. However, we prove it to be the case for logistic regression in Lemma5, and the proof extends to other widely-used Now, we use separate tools to analyze the covering number continuous classification losses. For linear regression with of 푤(퐹ℱ ) for 푤 of bounded variation and Lipschitz 푤. squared error loss, this is bounded by a constant times 푑 (Akama et al., 2010). We speculate that VC(ℒ) is on the 3.1. Weight Functions of Bounded Variation order of VC(ℱ) for reasonable continuous losses. In this section, we consider 푤 of bounded 2-variation 푉2(푤); 3. Uniform Convergence Results we have the following lemma. ′ ′ Lemma 3. Let 풞(휀 , 퐹ℱ ) be a 휀 cover of 퐹ℱ in ‖·‖∞. In this section, we present our main generalization result in Suppose that Assumption2 holds with 휀 from the state- 푤 ′ terms of a Rademacher complexity depending on . Then, ment of the assumption. Then the set 푤(풞(휀 , 퐹ℱ )) is a √︀ ′ we specialize the upper bound using an entropy 푉2(푤) 3(휀 + 휀 )-cover for 푤(퐹ℱ ) in ‖·‖푛. argument in two cases: (a) 푤 with bounded variation and (b) 푤 that is Lipschitz. The former result allows for far Lemma3 is one of our key technical and conceptual con- more general 푤, but our proofs lead to slower rates. At best, tributions. The key contribution is that we can bound a the rate of 푂(푛−1/4) can be achieved by instantiating our covering number even though 푤 may be discontinuous, and bounds, although a more refined argument could possibly this relies on constructing a number of partitions that may improve this. The latter result for Lipschitz 푤 allows for the be upper bounded by total variation as mentioned previously. usual 푂(푛−1/2) learning rate. The shortcoming of this result is also clear. Since 휀 from Assumption2 is of order 푛−1/2, the cover is only at the We use 퐹 to denote the set {퐹 } and define ℓ(ℱ) · ℱ 푓 푓∈ℱ resolution 휀1/2 = 푛−1/4. 푤(퐹ℱ ) to be the set {ℓ푓 푤(퐹푓 ) | 푓 ∈ ℱ}. We have the following bound. Corollary 1. Let ℱ be a set of predictors, the loss function ℓ take values in [0, 퐵 ] and is 휆 -Lipschitz in 푓(푋). If 푤 Theorem 1. Let ℱ be a set of predictors and the loss func- ℓ ℓ is bounded by 퐵 and the VC-dimension 푉 of ℒ satisfies tion ℓ take values in [0, 퐵 ]. By Assumption1, with proba- 푤 ℓ 1 ≤ 푉 ≤ 푛, by Assumption1, there exists a universal bility at least 1 − 훿, we have constant 퐶 such that with probability at least 1 − 훿, we have

sup |ℒℛ푤,푛(푓) − ℒℛ푤(푓)| 푓∈ℱ sup |ℒℛ푤,푛(푓) − ℒℛ푤(푓)| 푓∈ℱ ≤ 2R̂︀ 푛(ℓ(ℱ) · 푤(퐹ℱ )) + 4퐶퐵ℓ푉1(푤)R̂︀ 푛(ℒ) 2퐵 퐵 (︃ {︂ 24 ∫︁ ℓ 푤 (︂ 푡 )︂ √︃ √︃ ≤ inf 8휂 + √ log 풩 , ℱ, ‖·‖푛 4 4 휂≥0 푛 2퐵 휆 log 훿 log 훿 휂 푤 ℓ + 6퐵ℓ푉1(푤) + 3 . 2푛 2푛 )︃1/2 (︂ 푡2 )︂ }︂ + log 풩 2 2 − 휀, 퐹ℱ , ‖·‖∞ 푑푡 12퐵ℓ 푉2 (푤) In contrast to standard generalization bounds, our result has √︃ √︃ √︂ 4 4 the terms R (ℒ) and R (ℓ(ℱ) · 푤(퐹 )). Since the former 푉 log log ̂︀ 푛 ̂︀ 푛 ℱ + 4퐶퐵 푉 (푤) + 6퐵 푉 (푤) 훿 + 3 훿 , is a Rademacher complexity of indicator variables, we sim- ℓ 1 푛 ℓ 1 2푛 2푛 ply use a VC-dimension upper bound. The VC-dimension √︁ √︁ log 4 then needs to be analyzed for particular losses and ℱ. Ex- 푉 훿 where 휀 = 2퐶 푛 + 3 2푛 . amples of losses and ℱ that permit finite VC-dimension include linear regression with squared error loss and arbi- 3.2. Lipschitz Weight Functions trary ℱ with logistic loss. To analyze the latter empirical Rademacher complexity, we use the standard Dudley en- In this section, we consider the far simpler case when 푤 is tropy integral result in order to obtain covering numbers. 휆푤-Lipschitz. Again, we start with a bound for covering From here, we can more easily decompose the covering numbers. Rank-weighted Learning

Lemma 4. When 푤 is 휆푤-Lipschitz, we have the covering cover impossible. Finally, recall that the logistic loss is number bound ℓ(푓, 푍) = log (︀1 + 푒−푌 푓(푥))︀. Now, we start with our first lemma on VC-dimension. 풩 (푡, 푤(퐹 ), ‖·‖ ) ≤ 풩 (푡, 퐹 , ‖·‖ ) . ℱ 푛 ℱ ∞ Lemma 5. Let ℱ be the class of linear functions 푓(푥) = 휃|푥 + 푏 for 푥 in R푑. Then, the VC-dimension of ℒ for the logistic loss satisfies This leads naturally to a uniform convergence bound. VC(ℒ) ≤ 푑 + 2. Corollary 2. Let ℱ be a set of predictors, the loss function ℓ take values in [0, 퐵ℓ] and is 휆ℓ-Lipschitz in 푓(푋). If 푤 is bounded by 퐵푤 and is 휆ℓ-Lipschitz and the VC-dimension 푉 of ℒ satisfies 1 ≤ 푉 ≤ 푛, there exists a universal constant This type of bound is not particularly surprising, and other 퐶 such that with probability at least 1 − 훿, we have bounds on the VC-dimension of ℒ in terms of the VC- dimension of ℱ are known for simple losses. Next, we sup |ℒℛ푤,푛(푓) − ℒℛ푤(푓)| consider the bound on the covering numbers of the losses. 푓∈ℱ (︃ Lemma 6. Let (푋, 푌 ) have the distribution defined above, {︂ ∫︁ 2퐵ℓ퐵푤 (︂ )︂ 24 푡 and let ℱ be as in equation (6). Then, we have the 휀-entropy ≤ inf 8휂 + √ log 풩 , ℱ, ‖·‖푛 휂≥0 푛 휂 2퐵푤휆ℓ bound √ 퐶(ℱ) 푑 (︂ )︂)︃1/2 }︂ log 풩 (휀, 퐹 , ‖·‖ ) ≤ . 푡 ℱ ∞ 휀 + log 풩 , 퐹ℱ , ‖·‖∞ 푑푡 2퐵휆휆푤 √︃ √︃ √︂ 4 4 To the best of our knowledge, such bounds on the set of 푉 log 훿 log 훿 + 4퐶퐵ℓ푉1(푤) + 6퐵ℓ푉1(푤) + 3 . cumulative distribution functions are novel. Finally, we may 푛 2푛 2푛 apply these lemmas to obtain the following corollaries for weight functions of bounded variation and Lipschitz weight Note the key difference between Corollary1 and Corollary2. functions. We start with the former. 2 The presence of 푡 − 휀 in the former√ ensures that 휂 needs Corollary 3. Let (푋, 푌 ) have the distribution defined to be non-zero, and further, 휂 ≥ 휀. This leads to the slow above. Let ℱ be as defined in equation (6). Suppose that rate of convergence. In the latter corollary, this is not a the logistic loss is bounded by 퐵ℓ on 퐵(푑) × {+1, −1} and problem; thus, we obtain the standard rate of convergence. is 휆ℓ-Lipschitz with respect to 푓(푥) for all 푓 in ℱ. Suppose that 푤 has bounded first and second variations 푉1(푤) and 4. Logistic Regression Example 푉2(푤). Then with probability at least 1 − 훿, we have

sup |ℒℛ푤,푛(푓) − ℒℛ푤(푓)| We consider a basic example for logistic regression. To 푓∈ℱ instantiate the bounds, we need two things: (1) a bound on − 1 the VC dimension of ℒ and (2) a covering number bound on ≤ 퐶(ℱ, 푑, 훿, 퐵ℓ, 퐵푤, 푉1(푤), 푉2(푤))푛 4 . 퐹ℱ , the distributions of losses over the predictor class ℱ. Thus, we specify a distribution for (푋, 푌 ). Let 푆푑−1 = Next, we consider Lipschitz weight functions. {푥 ∈ R푑 : ‖푥‖ = 1} denote the (푑 − 1)-dimensional sphere 푑 푑 Let (푋, 푌 ) have the distribution defined in R , and let 퐵(푑) = {푥 ∈ R : ‖푥‖ ≤ 1} denote the unit Corollary 4. 푑 푑−1 above. Let ℱ be as defined in equation (6). Suppose that ball of radius 1 in R . Let 휃* ∼ Uniform(푆 ) be the true regression vector. Suppose that 푋 ∼ Uniform(퐵(푑)) and the logistic loss is bounded by 퐵ℓ on 퐵(푑) × {+1, −1} and | is 휆 -Lipschitz with respect to 푓(푥) for all 푓 in ℱ. Finally, that 푌 takes the value +1 with probability 푝 = (1+휃* 푋)/2 ℓ and −1 otherwise. suppose that 푤 is 휆푤-Lipschitz, and let 퐶 be some universal constant. Then with probability at least 1 − 훿, we have Next, we restrict the class of regressors, ℱ. We set sup |ℒℛ푤,푛(푓) −ℒℛ푤(푓)| 푑 ℱ = {푓 : 푓(푥) = 휃|푥 for 휃 ∈ R s.t. 푟1 ≤ ‖휃‖ ≤ 푟2}, 푓∈ℱ (6) − 1 ≤ 퐶(ℱ, 푑, 훿, 퐵ℓ, 퐵푤, 휆ℓ)푛 2 . where 푟1, 푟2 > 0 are fixed constants. We may refer directly to the parametrizations 휃 for simplicity. Note that bounding the 휃 away from the zero vector is nec- Here, we obtain the desired rates of convergence showing essary because as 휃 approaches the zero vector, the dis- that the generalization bound may be instantiated and lead tribution of losses approaches a step function, making a to quantifiable learning rates. Rank-weighted Learning

5. Proofs 6.1. Setups In this section, we prove our main theorem (Theorem1). In the logistic regression setup, the features are drawn from Lemmas are proved in the Appendix. The first key is to 푋 ∼ Uniform(퐵(푑)) where 퐵(푑) is the 푑-dimension ball. choose the correct decomposition to analyze the main esti- The labels are sampled from a distribution over {−1, 1} mator. This is a necessary step because we cannot directly where 푌 takes the value +1 with probability (1+푋⊤휃*)/2. apply standard generalization results to obtain the bound for where 휃* = (1, 0, 0, ··· , 0) ∈ R푑. We use the logistic loss 푛 (︀ ⊤ )︀ rank-weighted estimators since the terms {퐹푓,푛(ℓ푓,푖)}푖=1 ℓ(휃;(푋, 푌 )) = log 1 + exp(−푌 푋 휃) . have used the samples {ℓ(푓, 푍 )}푛 twice, and so there is 푖 푖=1 In the linear regression experiment, we draw our covariates dependence across samples. Our goal is to use symmetriza- from a Gaussian 푋 ∼ 풩 (0, I ) in 푑. The noise distribu- tion with the main error. 푑 R tion is fixed as 휀 ∼ 풩 (0, 0.01). We draw our response vari- Lemma 7. We have the inequality ⊤ * * 푑 able 푌 as, 푌 = 푋 휃 + 휀 where 휃 = (1, 1, ··· , 1) ∈ R . (︂ )︂ ℓ(휃;(푋, 푌 )) = 1 (푌 − 푋⊤휃)2 ′ We fix the squared error 2 as P sup |ℒℛ푤,푛(푓) − ℒℛ푤(푓)| > 푡 + 푡 our loss function. 푓∈ℱ (︂ ⃒ 푛 ⃒ )︂ ⃒ 1 ∑︁ ⃒ 6.2. Metrics and Methods ≤ P sup ⃒ℒℛ푤,푛(푓) − ℓ푓,푖푤(퐹푓 (ℓ푓,푖))⃒ > 푡 푓∈ℱ ⃒ 푛 ⃒ 푖=1 For each setting, we look at two metrics: (a) approximate (︂ ⃒ 푛 ⃒ )︃ ⃒ 1 ∑︁ ⃒ ′ uniform error; and (b) training and testing performance of + P sup ⃒ ℓ푓,푖푤(퐹푓 (ℓ푓,푖)) − ℒℛ푤(푓)⃒ > 푡 . ⃒푛 ⃒ the empirical minimizer of different L-Risks. We approxi- 푓∈ℱ 푖=1 mate the uniform error when 푑 = 1 in the following manner: for logistic regression, we pick the interval (−1.5, 1.5) to The proof of the lemma is immediate, but it sets the stage to be our parameter space Θ and construct a grid of size 200 analyze the risk. Next, we present a lemma to deal with the where each grid point represents a model 휃. The true L-Risk first term on the right hand side of Lemma7. for each 휃 is then approximated by the empirical L-Risk Lemma 8. For some universal constant 퐶, we have using 20, 000 samples. The empirical L-Risk is calculated (︃ ⃒ 푛 ⃒ using sample size 푛 ranging from 20 to 22, 100. Finally, ⃒ 1 ∑︁ ⃒ sup⃒ℒℛ (푓) − ℓ 푤(퐹 (ℓ ))⃒ the approximated uniform error at each sample size 푛 is P ⃒ 푤,푛 푛 푓,푖 푓 푓,푖 ⃒ 푓∈ℱ⃒ 푖=1 ⃒ obtained by taking the maximum over the difference of the √︃ )︃ empirical and true L-Risk for all 휃. In the linear regression log 2 훿 setup, the parameter space is chosen to be (−15, 15) with > 4퐵ℓ푉1(푤)R̂︀ 푛(ℒ) + 6퐵ℓ푉1(푤) ≤ 훿. 2푛 size of 500. Te explore the training and testing performance of the em- The proof of this lemma is given in the Appendix due to pirical minimizer of different L-Risks, we perform the fol- space concerns. However, note that this is the second lemma lowing iterative optimization procedure to obtain a heuristic that makes use of partitions towards using first bounded minimizer: variations in its proof. To deal with the second term of 푛 Lemma7, we use a standard Rademacher complexity bound 훾푡 ∑︁ 휃푡+1 = 휃푡 − 푤푡∇ ℓ(휃푡; 푍 ), (7) that we provide in the Appendix, which yields 푛 푖 휃 푖 푖=1

(︃ ⃒ 푛 ⃒ 푡 푡 ⃒ 1 ∑︁ ⃒ for all 푡 ∈ {0, ··· , 푇 − 1}, where 푤푖 = 푤(퐹푛(ℓ(휃 ; 푍푖))) sup⃒ ℓ 푤(퐹 (ℓ )) − ℒℛ (푓)⃒ P ⃒푛 푓,푖 푓 푓,푖 푤 ⃒ is the rank-dependent weighting and 훾푡 is the learning rate. 푓∈ℱ⃒ 푖=1 ⃒ Although in general, the empirical L-Risks are non-convex √︃ 2 )︃ with respect to the model parameter 휃, there are special log 훿 > 2R̂︀ 푛(ℓ(ℱ) · 푤(퐹ℱ )) + 3 ≤ 훿. cases like CVaR, which has a convex dual form that is easy 2푛 to optimize (Rockafellar et al., 2000):

6. Experiments {︂ 1 }︂ ℛCVaR,훼(푓) = inf 휂 + E[max(ℓ푓 − 휂, 0)] . (8) Following our theoretical analysis, we test our results on 휂∈R 훼 logistic regression and linear regression. We describe our setups in Section 6.1, the metrics as well as the optimization In such cases, we compare the heuristic method with the methods we have used in Section 6.2, and finally discuss procedure that first optimizes the L-Risk with respect to 휂 our results in Sections 6.3. and then with respect to 휃 at each iteration. Rank-weighted Learning

(a) log(uniform error) v.s. log 푛 (b) 푑 = 5; 푛 from 20 to 22, 100 (c) 푛 = 2000; 푑 from 7 to 55

Figure 2. Experimental results for logistic regression are averaged over 20 seeds. Figure 2a is the log-log plot of the approximated uniform error with respect to different sample sizes. Figure 2b and Figure 2c are log-log plots of the performance of the empirical minimizer of different L-Risks with respect to sample size and sample dimensions. For HRM, we have chosen 푎 = 0.6 and 푏 = 0.4. For CVaR and trimmed mean, 훼 is set to be 0.1. The results for logistic regression suggests similar convergence behavior across different L-Risks.

(a) log(uniform error) v.s. log 푛 (b) 푑 = 5; 푛 from 20 to 22, 100 (c) 푛 = 2000; 푑 from 7 to 55

Figure 3. Experimental results for linear regression are averaged over 20 seeds. Figure 3a is the log-log plot of the approximated uniform error with respect to different sample sizes. Figure 3b and Figure 3c are log-log plots of the performance of the empirical minimizer of different L-Risks with respect to sample size and sample dimensions. For HRM, we have chosen 푎 = 0.3 and 푏 = 0.4. For CVaR and trimmed mean, 훼 is set to be 0.1. The results for linear regression suggests similar convergence behavior across different L-Risks.

6.3. Results the convergence performance across different L-Risks are similar. Similar to the log-log plots, this may suggest that Our results for logistic regression are given in Figure2, there are faster rates of convergence than the ones we have and our results for linear regression are given in Figure3. obtained in Section3, especially for the weighting functions Figure 2a and Figure 3a are the log-log plots of the approxi- that are not Lipschitz but have bounded variation. mated uniform error with respect to the sample size in the two experimental settings. We fit a line for the data of each Finally, we make a remark about the optimization of CVaR L-Risk to examine the rate of convergence of these approx- by its rank-weighted formulation versus its convex dual. As imated uniform errors. In both the logistic regression and shown in Appendix ??, we observe that when optimizing linear regression experiments, we see that rank-weighting CVaR, our heuristic method given in equation (7) improves does not have much effect on the convergence rate. In the lo- more quickly than the convex dual form of CVaR equa- gistic regression case, the 푛−1/2 convergence rate of HRM tion (8). However, we do note that the dual form achieves a matches the expectation since its weight function is Lips- marginally better solution in terms of training error. Practi- chitz. The convergence rate for CVaR and trimmed mean cally, the comparable performance of the rank-weighted op- suggests that there might be tighter uniform convergence timization perspective and the provably convex perspective results for weight function with bounded variation. in the special case where we can compare the two is encour- aging for other weight functions. The code of the experi- Figure 2b, 2c, 3b and 3c show the behavior of the L-Risk ments can be found at https://bit.ly/2YzwRkJ. of the empirical minimizers. Empirically, we observe that Rank-weighted Learning

7. Discussion Garcıa, J. and Fernandez,´ F. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning In this work, we study uniform convergence of rank- Research, 16(1):1437–1480, 2015. weighted risks, which we define as L-Risk. Different from classical population risks, L-Risk utilizes the ranking of the Gopalan, A., Prashanth, L., Fu, M., and Marcus, S. losses. There are a number of future directions in studying Weighted bandits or: How bandits learn distorted values L-Risk. One direction, suggested by our experimental re- that are not expected. In Thirty-First AAAI Conference sults in Section6, is to obtain faster rates of convergence on Artificial Intelligence, 2017. for L-Risk with weighting functions of bounded variations. In addition to that, it is also of interest to obtain a lower Howard, R. A. and Matheson, J. E. Risk-sensitive markov bound of the convergence rate for L-Risk with non-Lipschitz decision processes. Management science, 18(7):356–369, weighting functions. Finally, since L-Risks are in general 1972. non-convex with respect to the model parameters, in order Jiang, A. H., Wong, D. L.-K., Zhou, G., Andersen, D. G., to use it to learn machine learning models, we need to in- Dean, J., Ganger, G. R., Joshi, G., Kaminksy, M., vestigate on algorithms for optimizing them. In particular, Kozuch, M., Lipton, Z. C., et al. Accelerating deep it would be useful to better understand simple optimization learning by focusing on the biggest losers. arXiv preprint methods for arbitrary rank-weighting functions, such as the arXiv:1910.00762, 2019. iterative approach we have used in the experiments. Krokhmal, P., Zabarankin, M., and Uryasev, S. Model- Acknowledgements ing and optimization of risk. In HANDBOOK OF THE FUNDAMENTALS OF FINANCIAL DECISION MAK- We acknowledge the support of NSF via IIS-1909816, and ING: Part II, pp. 555–600. World Scientific, 2013. ONR via N000141812861. Lai, K. A., Rao, A. B., and Vempala, S. Agnostic estimation of mean and covariance. In Foundations of Computer References Science (FOCS), 2016 IEEE 57th Annual Symposium on, Akama, Y., Irie, K., Kawamura, A., and Uwano, Y. VC pp. 665–674. IEEE, 2016. dimensions of principal component analysis. Discrete & Leqi, L., Prasad, A., and Ravikumar, P. On human-aligned Computational Geometry, 44(3):589–598, 2010. risk minimization. In Advances in Neural Information Processing Systems, 2019. Bhat, S. P. Improved concentration bounds for condi- tional value-at-risk and cumulative prospect theory using Lugosi, G. and Mendelson, S. Robust multivariate mean es- wasserstein distance. arXiv preprint arXiv:1902.10709, timation: the optimality of trimmed mean. arXiv preprint 2019. arXiv:1907.11391, 2019.

Bhatia, K., Jain, P., and Kar, P. Robust regression via hard Musielak, J. and Orlicz, W. On generalized variations (i). thresholding. In Advances in Neural Information Process- Studia Mathematica, 18(1):11–41, 1959. ing Systems, pp. 721–729, 2015. Prasad, A., Suggala, A. S., Balakrishnan, S., and Ravikumar, Carothers, N. L. Real Analysis. Cambridge University Press, P. Robust estimation via robust gradient estimation. arXiv 2000. preprint arXiv:1802.06485, 2018. Prasad, A., Balakrishnan, S., and Ravikumar, P. A uni- Daniell, P. Observations weighted according to order. Amer- fied approach to robust mean estimation. arXiv preprint ican Journal of Mathematics, 42(4):222–236, 1920. arXiv:1907.00927, 2019. Diecidue, E. and Wakker, P. P. On the intuition of rank- Prashanth, L., Jie, C., Fu, M., Marcus, S., and Szepesvari,´ C. dependent utility. Journal of Risk and Uncertainty, 23(3): Cumulative prospect theory meets reinforcement learning: 281–298, 2001. Prediction and control. In International Conference on Machine Learning, pp. 1406–1415, 2016. Duchi, J. and Namkoong, H. Learning models with uni- form performance via distributionally robust optimization. Rockafellar, R. T., Uryasev, S., et al. Optimization of condi- arXiv preprint arXiv:1810.08750, 2018. tional value-at-risk. Journal of risk, 2:21–42, 2000. Duchi, J. C., Hashimoto, T., and Namkoong, H. Distri- Shah, V., Wu, X., and Sanghavi, S. Choosing the sam- butionally robust losses against mixture covariate shifts. ple with lowest loss makes sgd robust. arXiv preprint preprint, 2019. arXiv:2001.03316, 2020. Rank-weighted Learning

Sra, S., Nowozin, S., and Wright, S. J. Optimization for machine learning. Mit Press, 2012. Tversky, A. and Kahneman, D. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5(4):297–323, 1992.