arXiv:2108.12627v1 [stat.ML] 28 Aug 2021 fbt olsadcet loihswihaebt robust best both are negligible the near which convergence combine fast algorithms have to and create importance and against utmost worlds learnin and both of convergence of is fast quadra has it the it since Therefore, convex, Nonetheless, strongly ordinali values. is their their loss not by data, determined contribut the effectively their in is since estimation outliers and the arbitrary problems, these against have robust not is some does prese loss not with are absolute distorted outliers the the heavily However, when be to compared may outliers b robus estimates extreme distributional can the sufficient mean) have Thus, the not [6]. (i.e., does and minimizer it outliers, its i.e., heav has poor, of by is efficiency loss data dominated nominal the squared the be tailed, of The distribution can learning underlying robust. the it fast is when that a loss disadvantage has absolute the (hence, the convex and strongly rate) is loss squared ootir.I erigpolm,tovr omnyused loss, commonly (quadratic) very squared two the problems, res are learning intrinsically functions are In that outliers. functions to loss become design has it to strategies, portant [15]–[17] detection problems. some [14] paramete [13], in prediction useful and be i [12] can [11], model when It estimation learned mathematically. useful, resulting express be to the can hard of function evaluation loss performance designed the Howev suitably itself. a loss formulation used problem the using the Normally, commonly by provided system. are is a metric losses designing des various when gradient experimentally [10], like approaches, M-estimation robustn learning and achieve traditional to In formulations [9]. parametric literatu in including proposed robustness [8], been for have penalties properties Many particular i errors. their its desirabl large to more due the error) be to quadratic may sitivity the where (e.g., error) loss absolute tasks, nonrobust extremel a the over learning is (e.g., and approach loss robust This estimation a [6]. parameter [5], inlier in by data) common than nominal outliers traine the some by model (i.e., influenced a less that be i.e., optimized) robustness, require [1]–[4] ature h boueloss, absolute the sntta uhhre hntetaiinlma n . and mean metr traditional centralizing the a than finding harder much that that show t which minimizer not the and and is function find functions absolute to loss loss algorithm the an such a both provide We of achieve loss. properties quadratic can specifi desirable choice, we the of combines transform; function suitable log-exp a with the that show We loss. ngnrl hnw aetefedm nta fusing of instead freedom, the have we when general, In aypolm nlann,otmzto n ttsislit and optimization learning, in problems Many Abstract eeaie ue osfrRbs erigand Learning Robust for Loss Huber Generalized t fcetMnmzto o outStatistics Robust a for Minimization Efficient its W rps eeaie omlto fteHuber the of formulation generalized a propose —We L ( x .I I. = ) NTRODUCTION | x | h neligraosaethe are reasons underlying The . L anGkeu aa Gokcesu Hakan Gokcesu, Kaan ( x = ) x 2 e[7], re o to ion tness. istant nsen- and , (or d loss. cally with cent im- ties ess er- nt. tic er, he y- of g. ic y e e s s r a yasti rbe yuigafe aibeas variable free W a datasets. using certain by for problem useless this even bypass or can arbitrary be to prove hl hsfnto scniuu,tesrc uofat cutoff strict the continuous, is function this While hywr h et sa xml,w a straightforwardly can we function example, following where an the losses As use absolute best. and the piecewis quadratic quadratic work a they the the use combine to and is to absolute approach function the straightforward most of The properties loss. the combine to ie,teHbrls)[3]. functio loss) loss popular Huber absolute most the and the (i.e., quadratic the gives combine which to following, approach tasks. the learning as in combination use to difficult differentia be not to is prove it useful, may somewhat and is version this While oua n stePed-ue os[18]. loss Pseudo-Huber the is one popular di twice smooth. not not is thus it and differentiable, entiable is function this Although hc is which ucin.I eto ,w ns ihfrhrdsusosa discussions further with remarks. convex, finish concluding we loss strictly V, such In minimizes Section formulation. which a In algorithm functions. generalized an produce the design the we from IV, we of Section loss In III, formulation robust following. Section and generalized the smooth In the is loss. provide paper Huber we our II, of Section organization encapsulate to formulation The algor an loss, a solver. with propose Generalized-Huber together variants, we the (possible) different end, call many this we To what it. for of variant some or [19]. exist also function loss approximations Huber smooth the other form, of common most the is above ocet outls ihfs ovrec,w need we convergence, fast with loss robust a create To osletedfeetaiiyise n a oiythe modify can one issue, differentiability the solve To o mohes ayvrat aebe rpsd A proposed. been have variants many smoothness, For l nal h ovnini oueete h ue loss Huber the either use to is convention the all, in All 2 1 δ x 2 L + D L L ( L P,δ δ x Hp P = ) near ( ( ( x x x = ) = ) ( = ) 0 | 2 x 1 δ ( ( and | δ x | x | 1 δ

2 x x 2 x | | + r 2 | x + 1 1 2 | , , , δ | | , , taypoe.Wiethe While asymptotes. at x x | | x x | ≤ | x δ ≤ | | > , 2 2 | | δ > x x ! 1 1 | ≤ | δ . δ > , . δ 1 ithmic may ffer- ble, (1) (2) (3) (4) nd ns e e 1 2

II. THE GENERALIZED HUBER LOSS Proof. When x goes to , f( x) does not diverge, hence, ∞ − In this section, we introduce the Generalized-Huber loss. LG(x) g(f(x)) = x as x , (7) We first start with the definition of a general . → → ∞ LG(x) g(f( x)) = x as x , (8) → − − → −∞ Definition 1. Let L( ) be some loss function such that · which concludes the proof. L : , ℜ→ℜ Lemma 3. LG(x) converges to the quadratic loss near 0, i.e., 2 where the minimum is at x =0, i.e., LG(x) ax + b as x 0, → | | → arg min L(x) =0, where a = g′(2f(0))f ′′(0) and b = g(2f(0)). x min L(x)=L(0). Proof. We have x L (x)=g(f (x)), (9) For a general loss function L( ), we have the following G + · ′ ′ ′ property. LG(x)=g (f+(x))f+(x), (10) L′′ (x)=g′′(f (x))[f ′ (x)]2 + g′(f (x))f ′′ (x). (11) Lemma 1. If L( ) and its first derivative L′( ) are continuous G + + + + at x =0 with positive· second derivative (i.e.,· L′′(0) > 0) and where finite higher derivatives, L( ) has a quadratic behavior near , · f+(x) f(x)+ f( x), (12) x =0. ′ ′ ′− f+(x)=f (x) f ( x), (13) Proof. Near x =0, we have ′′ ′′ − ′′− f+(x)=f (x)+ f ( x). (14) 1 − L(x)= L(0) + L′(0)x + L′′(0)x2 + o(x2), (5) Thus, at x =0, we have 2 from Taylor’s expansion. Since L′(0) = 0 because of conti- LG(0) = g(2f(0)), (15) ′ nuity and minimum at 0, we have LG(0) = 0, (16) 1 L′′ (0) = 2g′(2f(0))f ′′(0), (17) L(x) ≅ L(0) + L′′(0)x2, (6) G 2 ′ ′′ ′′ since f+(0) = 2f(0), f+(0) = 0 and f+(0) = 2f (0). Hence, near 0. Hence, it suffices for the loss function to have positive ′ ′′ 2 LG(x) g (2f(0))f (0)x + g(2f(0)) as x 0, (18) second derivative and finite higher derivatives at zero for its → | | → convergence to a quadratic function. from (6), which concludes the proof. This result shows that if a loss function and its first Corollary 1. From Lemma 3, we have for Definition 2 and derivative are continuous at x =0, it has a quadratic behavior Definition 3 the following for small error. Unfortunately, the absolute loss function does L′′ (0) > 0 f ′′(0) > 0. not have a continuous derivative at x = 0. To solve this, we G ⇐⇒ smooth it with a isotonic/monotonic auxiliary function f Proof. Since f( ) is monotonically increasing, so is g( ) (i.e., ( ) · · [20]. · f −1(x)). Thus, ′ ′ Definition 2. Let f( ) be some monotone increasing function f (x),g (x) > 0, x · ∀ such that i.e., both derivatives are strictly greater than 0, which con- lim f(x)= , cludes the proof. x→∞ ∞ Example 1. When we use a quadratic auxiliary function as lim f(x) < . x→−∞ ∞ x2 f(x)= U(x) +1, The auxiliary function f( ) diverges when x and δ2 converges when x . Using· this auxiliary function,→ ∞ we   where U(x) is the step function. Note that this f( ) is not create the smoothed→ absolute −∞ loss as the following. increasing everywhere and does not have a direct· inverse. Definition 3. Let the smoothed loss be However, we can use the following pseudo-inverse √ LG(x)= g(f(x)+ f( x)), g(x)= δ x 1, x 1. − − ≥ where g( ) is the inverse of f( ), i.e., f −1(x). Thus, the loss function becomes · · x2 We can see that for a smooth and differentiable f( ) L(x)= g(f(x)+ f( x)) = δ +1 · 2 auxiliary function, this loss is also smooth and differentiable. − r δ Next, we study its behavior for small and large errors. which is the Pseudo-Huber loss.

Lemma 2. LG( ) converges to the absolute loss at the This generalized formulation does not guarantee convexity asymptotes, i.e., · over the whole domain. For convexity to exist, specific func- tions need to be studied. In the next section, we will study lim LG(x)= x |x|→∞ | | one such function. 3

′ III. ASTRICTLY CONVEX SMOOTH ROBUST LOSS Remark 1. For LM ( ), its first derivative LM ( ) and its second derivative L′′ (·); we have the following: · For convexity, we study the exponential transform. Let M · 1 ax −ax ax L (x)= log(e + e + b), f(x)= e + b, (19) M a ax −ax ′ e e for some a> , hence LM (x)= − , 0 eax + e−ax + b 4a + ab(eax + e−ax) −1 1 ′′ g(x)= f (x)= log(x b), x>b (20) LM (x)= . a − (eax + e−ax + b)2 and the loss function is Remark 2. From (32), we see that for convexity on the whole domain, we need b 0 (since a > 0), which is in line with 1 ax −ax ≥ LM (x)= log(e + e + b), (21) Lemma 4. a

Corollary 2. LM ( ) converges to the absolute loss at the for b+2 > 0. The log-exp transform has a beautiful convexity · property. asymptotes from Lemma 2, i.e.,

Lemma 4. Let li(x) for i 1,...,I be I convex functions. LM (x) x as x . We have the following convex∈{ function} L( ) → | | | | → ∞ · Corollary 3. LM ( ) converges to the quadratic loss near 0 I from Lemma 3, specifically· L(x) , log eli(x) . i ! =1 1 a 2 X LM (x) log(2 + b)+ x as x 0. → a (2 + b) | | → Proof. Let x = λx + (1 λ)x for some 0 λ 1. We   1 − 2 ≤ ≤ have Proof. At x =0, we have I I li(x) λli(x1)+(1−λ)li(x2) 1 e e , (22) LM (0) = log(2 + b), (32) ! ≤ ! a i=1 i=1 ′ X X LM (0) =0, (33) from the convexity of li( ). Setting 2a · L′′ (0) = , (34) M b +2 λli(x1) ai , e , (23) for b +2 > 0. The result comes from Taylor’s expansion near b , e(1−λ)li(x2), (24) i zero. we have Example 2. For any finite b, when a goes to infinity, we have I I the absolute loss, i.e., L1 loss. li(x) e aibi , (25) ≤ i=1 ! i=1 ! Example 3. If b = 1, a = 1, we have direct convergence X X 2 − I λ I 1−λ to x near x =0. However, while this most straightforwardly 1 1 a λ b 1−λ , (26) combines x and x2, it is not convex. ≤ i i | | i=1 ! i=1 ! X X Example 4. If b = 0 and a = 1, we have the log-cosh loss from Holder’s inequality [21]. Thus, [23] translated by log(2).

Remark 3. When b 2, the loss function LM ( ) has the L(x)=L(λx1 + (1 λ)x2) (27) ≥ · − following alternative expression: I = log eli(λx1+(1−λ)x+2) (28) 1 ax 1 1 −ax 1 ! LM (x)= log ce + + log ce + , i=1 a c a c XI I 1 1     λ 1−λ λ log ai + (1 λ) bi , (29) where c> 0 is such that ≤ − i=1 ! i=1 ! X X 1 I I c2 + = b. λ log eli(x1) + (1 λ) eli(x2) , (30) c2 ≤ − i=1 ! i=1 ! X X This formulation is advantages in that it separates the loss λL(x )+(1 λ)L(x ), (31) ≤ 1 − 2 function between two asymptotes, which can be straightfor- wardly used to design different losses with varying asymptotes. which concludes the proof. In the next section, we provide an algorithm to find the This result is intuitive from the smooth maximum [22]. minimizer of our loss function. 4

Algorithm 1 Finding the Centralizing Sample Pairs Algorithm 2 Finding an ǫ-optimal Solution

Initialize I0 =1 and I1 = N. Initialize x0 = xL, x1 = xH and ǫ. STEP: STEP: if I1 = I0 +1 then if x1 x0 <=2ǫ then ∗ ∗ ∗ − x0+x1 xL = xI0 , xH = xI1 xǫ = 2 ∗ ∗ ∗ Return xL and xH Return the ǫ-optimal point xǫ else if I1 = I0 +1 then else if If x1 x0 > 2ǫ then I06 +I1 x0+x1− I = 2 , where [ ] rounds to the nearest integer xˆ = 2 ′ · ′ Calculate G = L (xI ) Calculate G = L (ˆx) end if   end if if G =0 then if G =0 then ∗ ∗ Return the minimizer x = xI Return the minimizer x =x ˆ else if G> 0 then else if if G> 0 then Go to STEP with the update I1 = I Go to STEP with update x1 =x ˆ else if G< 0 then else if G< 0 then Go to STEP with the update I0 = I Go to STEP with update x0 =x ˆ end if end if

IV. THE MINIMIZER OF STRICTLY CONVEX LOSSES The properties at this remark are at the core of the algorithm, N which get ever so closer to the minimizer with each step of Let us have the samples x n=1, where we want to mini- mize the cumulative loss for{ some} function L ( ), i.e., the algorithm. A summary of it is given in Algorithm 1. 0 · N Remark 7. The algorithm terminates and returns either the ∗ ∗ min L(x) , min L0(x xn). (35) minimizer x ; or, if it was not able to find the minimizer x , x x − ∗ ∗ n=1 it returns two adjacent sample points x and x , where the X L H Remark 4. For the absolute loss, a minimizer is the median: minimizer is such that N x∗ (x∗ , x∗ ). N ∈ L H arg min x xn = median( xn n=1), x | − | { } n=1 Remark 8. The runtime of the algorithm is O N N X ( log ) N since the gradient is calculated for O(log N) times and each i.e., when xn n are ordered, we have { } =1 calculation takes O(N) time. This linearithmic complexity N x N+1 , N is odd [24] is efficient since if the samples xn=1 were unordered, median x N 2 , { } ( n n=1)= 1 ordering them has that much complexity. { } (x N + x N ), N is even ( 2 2 2 +1 N For many applications, finding the centralizing adjacent pair which has O(N log N) if xn n=1 is unordered. ∗ ∗ { } xL, xH maybe sufficient as in the case of absolute loss for Remark 5. For the quadratic loss, the minimizer is the mean: {even number} of samples. In the absolute loss, the median is

N N an arbitrary minimizer, which is the mean of the centralizing 2 1 pair by definition. However, each point between that pair is arg min x xn = xn, x | − | N also a minimizer. n=1 n=1 X X If the centralizing pair is not enough and we want to find N which has O(N) complexity when xn is unordered. { }n=1 the minimizer up to a chosen closeness ǫ, we can run a similar Although both the absolute and the quadratic losses have algorithm, which is given in Algorithm 2. closed form minimizers, it may not be possible for general Remark 9. If the algorithm was not able to find the minimizer loss functions. However, it is possible to find a close minimizer ∗ ∗ x ; it returns an ǫ optimal point xǫ , which is efficiently if the loss L0( ) is strictly convex as in Section III. · ∗ ∗ x xǫ ǫ Remark 6. When L0(x) is strictly convex, we have the | − |≤ following properties: D Remark 10. The runtime of the algorithm is O(N log( ǫ )), • L(x) is also strictly convex, hence, it has a unique where D is the separation between xL and xH , i.e., minimizer x∗. ′ • When the gradient is zero, i.e., L (x)=0 for some x; it D , xH xL. is the minimizer, i.e., x = x∗. − ′ • When the gradient is positive, i.e., L (x) > 0 for some We have this complexity since the gradient is calculated ∗ xH −xL x; it is greater than the minimizer, i.e., x > x . for O(log( ǫ )) times and each calculation takes O(N) ′ ∗ ∗ • When the gradient is negative, i.e., L (x) < 0 for some time. (xL, xH ) can either be (xL, xH ) from Algorithm 1, or ∗ x; it is less than the minimizer, i.e., x < x . (mini xi, maxi xi) which can be found in O(N) time. 5

V. DISCUSSIONS AND CONCLUSION REFERENCES [1] H. V. Poor, An Introduction to Signal Detection and Estimation. NJ: In this work, we have studied to combine the nice properties Springer, 1994. of the absolute loss (robustness) and the quadratic loss (strong [2] N. Cesa-Bianchi and G. Lugosi, Prediction, learning, and games. convexity). Our loss definition is straightforward to use in Cambridge university press, 2006. [3] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical multivariate case by analyzing each dimension individually. Learning, ser. Springer Series in Statistics. New York, NY, USA: Note that in both absolute and quadratic loss settings, the Springer New York Inc., 2001. multidimensional optimization problem reduces to the opti- [4] S. Portnoy and X. He, “A robust journey in the new millennium,” Journal of the American Statistical Association, vol. 95, no. 452, pp. 1331–1335, mization in each dimension separately. 2000. In literature, there are some nice properties to have for [5] T. Hastie, R. Tibshirani, and M. Wainwright, Statistical learning with centralizing metrics like the equivariance under scaling, trans- sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2019. [6] P. J. Huber, . John Wiley & Sons, 2004, vol. 523. lation, rotation or some other transform [25]. We point out [7] M. J. Black and A. Rangarajan, “On the unification of line processes, that every loss function (including ours) that depends on the outlier rejection, and robust statistics with applications in early vision,” difference between the parameter x and the samples x will International journal of computer vision, vol. 19, no. 1, pp. 57–91, 1996. n [8] Z. Zhang, “Parameter estimation techniques: A tutorial with application be equivariant under translation. However, our loss functions to conic fitting,” Image and vision Computing, vol. 15, no. 1, pp. 59–76, are not equivariant under scaling in contrast to the absolute 1997. and quadratic losses. In general, equivariancy under scaling [9] J. T. Barron, “A general and adaptive robust loss function,” in Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern may not even be desirable, especially if xn are outputs of a Recognition, 2019, pp. 4331–4339. non-linear transform. Furthermore, our loss is not equivariant [10] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel, under rotations unlike the quadratic loss. The reason is our Robust statistics: the approach based on influence functions. John Wiley & Sons, 2011, vol. 196. loss does not depend on the euclidean distance in the mul- [11] K. Gokcesu and S. S. Kozat, “Online density estimation of nonstationary tivariate case (which is needed for rotational equivariancy). sources using exponential family of distributions,” IEEE Trans. Neural Note that the median also is not equivariant. Although there Networks Learn. Syst., vol. 29, no. 9, pp. 4473–4478, 2018. [12] J. V. Beck and K. J. Arnold, Parameter estimation in engineering and are equivariant extensions like the geometric median [26], science. James Beck, 1977. such an enforced intra-dimensional relation may not always [13] N. D. Vanli, K. Gokcesu, M. O. Sayin, H. Yildiz, and S. S. Kozat, be meaningful. Moreover, our loss is not equivariant under “Sequential prediction over hierarchical structures,” IEEE Transactions on Signal Processing, vol. 64, no. 23, pp. 6284–6298, Dec 2016. arbitrary monotonic transforms unlike the median. Note that [14] A. C. Singer and M. Feder, “Universal linear prediction by model order the mean is also not equivariant under monotonic transforms. weighting,” IEEE Transactions on Signal Processing, vol. 47, no. 10, Such a strong property comes with its disadvantages, where the pp. 2685–2699, Oct 1999. [15] K. Gokcesu and S. S. Kozat, “Online anomaly detection with minimax samples values xn become almost inconsequential and only optimal density estimation in nonstationary environments,” IEEE Trans. their ordinality matters. However, this disregard for the values Signal Process., vol. 66, no. 5, pp. 1213–1227, 2018. also what makes the median a robust with the highest [16] I. Delibalta, K. Gokcesu, M. Simsek, L. Baruh, and S. S. Kozat, “Online anomaly detection with nested trees,” IEEE Signal Process. Lett., vol. 23, possible breakdown point (i.e., the most resistant statistic). no. 12, pp. 1867–1871, 2016. Nonetheless, it is straightforward to make the estimation [17] K. Gokcesu, M. M. Neyshabouri, H. Gokcesu, and S. S. Kozat, “Se- equivariant. For example, if the input data is whitened at the quential outlier detection based on incremental decision trees,” IEEE Trans. Signal Process., vol. 67, no. 4, pp. 993–1005, 2019. preprocessing stage, the estimations will be equivariant in ro- [18] P. Charbonnier, L. Blanc-F´eraud, G. Aubert, and M. Barlaud, “De- tations. When it is variance normalized (or some other distance terministic edge-preserving regularization in computed imaging,” IEEE measure), it will be equivariant under scalings. Similarly, mean Transactions on image processing, vol. 6, no. 2, pp. 298–311, 1997. [19] K. Lange, “Convergence of em image reconstruction algorithms with normalization will make it equivariant under translations. gibbs smoothing,” IEEE transactions on medical imaging, vol. 9, no. 4, Although the absolute loss is robust, its efficiency decreases pp. 439–446, 1990. [20] K. Gokcesu and H. Gokcesu, “Optimally efficient sequential calibration substantially when the number of outliers are comparable of binary classifiers to minimize classification error,” arXiv preprint with the number of nominal data. However, the absolute loss arXiv:2108.08780, 2021. is the limit of convexity. While concave asymptotes can be [21] G. H. Hardy, J. E. Littlewood, G. P´olya, G. P´olya, D. Littlewood et al., Inequalities. Cambridge university press, 1952. considered to achieve further robustness, it will eliminate [22] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, “Dive into deep the convexity property, and will require global optimization learning,” arXiv preprint arXiv:2106.11342, 2021. techniques [27]. Such a loss metric will again have a quadratic [23] R. Neuneier and H. G. Zimmermann, “How to train neural networks,” in Neural networks: tricks of the trade. Springer, 1998, pp. 373–423. behavior near x =0, will be near linear (i.e., absolute loss) in [24] R. Sedgewick and K. Wayne, Algorithms. Addison-wesley professional, some intermediate region, and will be the concave function of 2011. choice at the asymptotes. Such loss designs can also be found [25] W. S. Sarle, “Measurement theory: Frequently asked questions,” Dis- seminations of the International Statistical Applications Institute, vol. 1, in the literature like the log-linear loss [28]. no. 4, pp. 61–66, 1995. In conclusion, our work proposes a generalized formulation [26] Z. Drezner, K. Klamroth, A. Sch¨obel, and G. O. Wesolowsky, “The of the Huber loss. We show that with a suitable function of weber problem,” Facility location: Applications and theory, pp. 1–36, 2002. choice, specifically the log-exp transform; we can achieve a [27] K. Gokcesu and H. Gokcesu, “Regret analysis of global optimiza- suitable loss function which combines the desirable properties tion in univariate functions with lipschitz derivatives,” arXiv preprint of both the absolute and the quadratic loss. We provide an arXiv:2108.10859, 2021. [28] D. Kim, C. Lee, S. Hwang, and M. K. Jeong, “A robust support algorithm to find the minimizer of such loss functions and vector regression with a linear-log concave loss function,” Journal of show that finding a centralizing metric is not that much harder the Operational Research Society, vol. 67, no. 5, pp. 735–742, 2016. than the traditional mean and median.