Generalized Huber Loss for Robust Learning and Its Efficient
Total Page:16
File Type:pdf, Size:1020Kb
1 Generalized Huber Loss for Robust Learning and its Efficient Minimization for a Robust Statistics Kaan Gokcesu, Hakan Gokcesu Abstract—We propose a generalized formulation of the Huber To create a robust loss with fast convergence, we need loss. We show that with a suitable function of choice, specifically to combine the properties of the absolute and the quadratic the log-exp transform; we can achieve a loss function which loss. The most straightforward approach is to use a piecewise combines the desirable properties of both the absolute and the quadratic loss. We provide an algorithm to find the minimizer of function to combine the quadratic and absolute losses where such loss functions and show that finding a centralizing metric they work the best. As an example, we can straightforwardly is not that much harder than the traditional mean and median. use the following function x2 , x 1 I. INTRODUCTION LP (x)= | |≤ . (1) ( x , x > 1 Many problems in learning, optimization and statistics liter- | | | | ature [1]–[4] require robustness, i.e., that a model trained (or While this function is continuous, the strict cutoff at 1 may optimized) be less influenced by some outliers than by inliers prove to be arbitrary or even useless for certain datasets. We (i.e., the nominal data) [5], [6]. This approach is extremely can bypass this problem by using a free variable as common in parameter estimation and learning tasks, where 1 2 a robust loss (e.g., the absolute error) may be more desirable δ x , x δ LP,δ(x)= | |≤ . (2) over a nonrobust loss (e.g., the quadratic error) due to its insen- ( x , x >δ sitivity to the large errors. Many penalties for robustness with | | | | their particular properties have been proposed in literature [7], While this version is somewhat useful, it is not differentiable, [8], including parametric formulations to achieve robustness and may prove to be difficult to use in learning tasks. [9]. In traditional learning approaches, like gradient descent To solve the differentiability issue, one can modify the and M-estimation [10], various losses are commonly used combination as the following, which gives the most popular experimentally when designing a system. Normally, the loss approach to combine the quadratic and absolute loss functions metric is provided by the problem formulation itself. However, (i.e., the Huber loss) [3]. using a suitably designed loss function can be useful, when 1 2 1 the performance evaluation of the resulting learned model is 2δ x + 2 δ , x δ LD(x)= | |≤ (3) hard to express mathematically. It can be useful in parameter ( x , x >δ estimation [11], [12] and prediction [13], [14] problems. | | | | In general, when we have the freedom, instead of using Although this function is differentiable, it is not twice differ- some outlier detection [15]–[17] strategies, it has become im- entiable and thus not smooth. portant to design loss functions that are intrinsically resistant For smoothness, many variants have been proposed. A to outliers. In learning problems, two very commonly used popular one is the Pseudo-Huber loss [18]. functions are the squared (quadratic) loss, L(x) = x2, and arXiv:2108.12627v1 [stat.ML] 28 Aug 2021 x2 the absolute loss, L(x)= x . The underlying reasons are the L x δ , (4) | | Hp( )= 1+ 2 squared loss is strongly convex (hence, has a fast learning r δ ! rate) and the absolute loss is robust. The squared loss has the disadvantage that it can be dominated by outliers, and which is 1 x2 + δ near 0 and x at asymptotes. While the 2δ | | when the underlying distribution of the nominal data is heavy- above is the most common form, other smooth approximations tailed, the efficiency of its minimizer (i.e., the mean) can be of the Huber loss function also exist [19]. poor, i.e., it does not have sufficient distributional robustness. All in all, the convention is to use either the Huber loss [6]. Thus, the estimates may be heavily distorted with some or some variant of it. To this end, we propose a formulation extreme outliers compared to when the outliers are not present. for what we call the Generalized-Huber loss, to encapsulate However, the absolute loss does not have these problems, and many different (possible) variants, together with an algorithmic is robust against arbitrary outliers since their contribution to solver. The organization of our paper is the following. In the estimation is effectively determined by their ordinalities Section II, we provide the generalized formulation of the in the data, not their values. Nonetheless, since the quadratic Huber loss. In Section III, we produce a strictly convex, loss is strongly convex, it has fast convergence and learning. smooth and robust loss from the generalized formulation. In Therefore, it is of utmost importance to combine the best Section IV, we design an algorithm which minimizes such loss of both worlds and create algorithms which are both robust functions. In Section V, we finish with further discussions and against outliers and have fast convergence near negligible loss. concluding remarks. 2 II. THE GENERALIZED HUBER LOSS Proof. When x goes to , f( x) does not diverge, hence, ∞ − In this section, we introduce the Generalized-Huber loss. LG(x) g(f(x)) = x as x , (7) We first start with the definition of a general loss function. → → ∞ LG(x) g(f( x)) = x as x , (8) → − − → −∞ Definition 1. Let L( ) be some loss function such that · which concludes the proof. L : , ℜ→ℜ Lemma 3. LG(x) converges to the quadratic loss near 0, i.e., 2 where the minimum is at x =0, i.e., LG(x) ax + b as x 0, → | | → arg min L(x) =0, where a = g′(2f(0))f ′′(0) and b = g(2f(0)). x min L(x)=L(0). Proof. We have x L (x)=g(f (x)), (9) For a general loss function L( ), we have the following G + · ′ ′ ′ property. LG(x)=g (f+(x))f+(x), (10) L′′ (x)=g′′(f (x))[f ′ (x)]2 + g′(f (x))f ′′ (x). (11) Lemma 1. If L( ) and its first derivative L′( ) are continuous G + + + + at x =0 with positive· second derivative (i.e.,· L′′(0) > 0) and where finite higher derivatives, L( ) has a quadratic behavior near , · f+(x) f(x)+ f( x), (12) x =0. ′ ′ ′− f+(x)=f (x) f ( x), (13) Proof. Near x =0, we have ′′ ′′ − ′′− f+(x)=f (x)+ f ( x). (14) 1 − L(x)= L(0) + L′(0)x + L′′(0)x2 + o(x2), (5) Thus, at x =0, we have 2 from Taylor’s expansion. Since L′(0) = 0 because of conti- LG(0) = g(2f(0)), (15) ′ nuity and minimum at 0, we have LG(0) = 0, (16) 1 L′′ (0) = 2g′(2f(0))f ′′(0), (17) L(x) ≅ L(0) + L′′(0)x2, (6) G 2 ′ ′′ ′′ since f+(0) = 2f(0), f+(0) = 0 and f+(0) = 2f (0). Hence, near 0. Hence, it suffices for the loss function to have positive ′ ′′ 2 LG(x) g (2f(0))f (0)x + g(2f(0)) as x 0, (18) second derivative and finite higher derivatives at zero for its → | | → convergence to a quadratic function. from (6), which concludes the proof. This result shows that if a loss function and its first Corollary 1. From Lemma 3, we have for Definition 2 and derivative are continuous at x =0, it has a quadratic behavior Definition 3 the following for small error. Unfortunately, the absolute loss function does L′′ (0) > 0 f ′′(0) > 0. not have a continuous derivative at x = 0. To solve this, we G ⇐⇒ smooth it with a isotonic/monotonic auxiliary function f Proof. Since f( ) is monotonically increasing, so is g( ) (i.e., ( ) · · [20]. · f −1(x)). Thus, ′ ′ Definition 2. Let f( ) be some monotone increasing function f (x),g (x) > 0, x · ∀ such that i.e., both derivatives are strictly greater than 0, which con- lim f(x)= , cludes the proof. x→∞ ∞ Example 1. When we use a quadratic auxiliary function as lim f(x) < . x→−∞ ∞ x2 f(x)= U(x) +1, The auxiliary function f( ) diverges when x and δ2 converges when x . Using· this auxiliary function,→ ∞ we where U(x) is the step function. Note that this f( ) is not create the smoothed→ absolute −∞ loss as the following. increasing everywhere and does not have a direct· inverse. Definition 3. Let the smoothed loss be However, we can use the following pseudo-inverse √ LG(x)= g(f(x)+ f( x)), g(x)= δ x 1, x 1. − − ≥ where g( ) is the inverse of f( ), i.e., f −1(x). Thus, the loss function becomes · · x2 We can see that for a smooth and differentiable f( ) L(x)= g(f(x)+ f( x)) = δ +1 · 2 auxiliary function, this loss is also smooth and differentiable. − r δ Next, we study its behavior for small and large errors. which is the Pseudo-Huber loss. Lemma 2. LG( ) converges to the absolute loss at the This generalized formulation does not guarantee convexity asymptotes, i.e., · over the whole domain. For convexity to exist, specific func- tions need to be studied. In the next section, we will study lim LG(x)= x |x|→∞ | | one such function. 3 ′ III. ASTRICTLY CONVEX SMOOTH ROBUST LOSS Remark 1. For LM ( ), its first derivative LM ( ) and its second derivative L′′ (·); we have the following: · For convexity, we study the exponential transform. Let M · 1 ax −ax ax L (x)= log(e + e + b), f(x)= e + b, (19) M a ax −ax ′ e e for some a> , hence LM (x)= − , 0 eax + e−ax + b 4a + ab(eax + e−ax) −1 1 ′′ g(x)= f (x)= log(x b), x>b (20) LM (x)= .