Active Regression with Adaptive Huber Loss Jacopo Cavazza and Vittorio Murino, Senior Member, IEEE
Total Page:16
File Type:pdf, Size:1020Kb
1 Active Regression with Adaptive Huber Loss Jacopo Cavazza and Vittorio Murino, Senior Member, IEEE Abstract—This paper addresses the scalar regression problem construct as many decision functions, finally fused together. through a novel solution to exactly optimize the Huber loss in Also, manifold regularization is another state-of-the-art semi- a general semi-supervised setting, which combines multi-view supervised method where the geometry of the data space learning and manifold regularization. We propose a principled algorithm to 1) avoid computationally expensive iterative schemes shapes the solution, typically by means of the graph Laplacian while 2) adapting the Huber loss threshold in a data-driven operator [3], [4]. fashion and 3) actively balancing the use of labelled data to However, even in the small labelling regime, since anno- remove noisy or inconsistent annotations at the training stage. In tations are usually provided by human operators, they are a wide experimental evaluation, dealing with diverse applications, frequently prone to errors and noisy in general, making them we assess the superiority of our paradigm which is able to combine robustness towards noise with both strong performance rather misleading. Hence, for both theoretical and practical and low computational cost. aspects, it is of utmost importance to devise algorithms which are able to automatically analyse the data as to guarantee Index Terms—Robust Scalar Regression, Learning with Noisy Labels, Huber Loss with Adaptive Threshold, Convex Optimiza- robustness towards outliers. In the literature, several works tion, Crowd Counting. have tackled such a problem [5]: since the archetypal work [6], many robust regression and classification frameworks [7]– [10] successfully leveraged on the (convex and differentiable) I. INTRODUCTION Huber loss function Regression is one of the most widely studied problems in different research disciplines such as statistics, econometrics, ( 2 y if jyj ≤ ξ computational biology and physics to name a few, and is a 2 Hξ : R ! [0; +1);Hξ(y) = ξ2 (1) pillar topic in machine learning. Formally, it addresses the ξjyj − 2 otherwise, problem of inferring a functional input-output relationship. where ξ > 0. However, as the major drawback of H , there Under a classical machine learning perspective, the latter is ξ is no closed-form solution to optimize it and, as a consequence, learnt by means of (training) examples and, to this aim, iterative schemes (such as quadratic programming [8] or self- two mainstream approaches pop up: optimization-based and dual minimization [9]) were previously exploited for either the Bayesian frameworks. In the former, once a suitable hypothesis original Huber loss [7], [9] or its spurious versions (hinge- space is fixed, the goal is minimizing an objective functional, Huber [8] or the huberized Laplacian [10]). Moreover, in where a loss measures how good the learned regressor re- all cases, additional computational efforts have to be spent produces the relationship between input and output variables in order to fix the threshold ξ, such as statistical efficiency inside the training set. Instead, in the Bayesian formalism, a analysis [7]. prior distribution constrains the solution upon some a priori In this work we face all the aforementioned issues trough knowledge, while the final regression map maximizes the the following main contributions. posterior/likelihood probability distribution. Actually, a problem affecting both paradigms lies in the I. We derive a novel theoretical solution to exactly op- scarcity of annotations in the training data, which can seri- timize the Huber loss in a general multi-view, semi- ously impact on the generalizability of the regression method. supervised and manifold regularized setting [11], in arXiv:1606.01568v2 [cs.LG] 26 Jun 2016 Nevertheless, labelled data are always time-consuming, fre- order to guarantee a broad applicability of the developed quently onerous and sometimes difficult to obtain. Thus, in formalism. this respect, semi-supervised approaches play a substantial II. We devise the novel Huber Loss Regression (HLR) role in exploiting unlabelled samples to support the search algorithm to efficiently implement the proposed solution for the solution. Among the most effective semi-supervised and avoid classical iterative schemes [8], [9]. Moreover, algorithms, multi-view learning [1], [2] considers the structure two additional aspects are notable. of the input data as composed by several “views” which Active-learning. While taking advantage of the both la- are associated to different hypothesis spaces employed to belled and unlabelled training samples, the former ones are inspected so that HLR automatically removes those Jacopo Cavazza and Vittorio Murino are with Pattern Analysis & Computer annotations which violate a specific numerical check, Vision, Istituto Italiano di Tecnologia (IIT), Via Morego, 30, 16163, Genova, whenever recognized as either noisy or inconsistent for Italy. Jacopo Cavazza is also with Dipartimento di Ingegneria Navale, Elettrica, learning the regressor. Elettronica e delle Telecomunicazioni (DITEN), University of Genova, Via Adaptive threshold. Differently from all [7]–[10], HLR Opera Pia 11, 16145, Genova, Italy. automatically learns ξ in a data-driven fashion without Vittorio Murino is also with Computer Science Department, University of Verona, Strada Le Grazie 15, 37134, Verona, Italy. increasing the computational complexity of the whole Primary email contact: [email protected]. pipeline. 2 P α α α α α III. Throughout an extensive empirical evaluation, we val- i c κ (z ; xi )ui , for α = 1; : : : ; m. Differently, in [1], an idate the proposed technique, which allows to score analogous result holds for m = 2 only. competitive results in curve fitting, learning with noisy labels, classical regression problems and crowd counting III. HUBER MULTI-VIEW REGRESSION WITH MANIFOLD application. While using variegate types of data and REGULARIZATION addressing diverse problems, HLR is able to outperform Once the hypothesis space H is defined according to Section state-of-the-art regression algorithms. II, we can perform optimization to learn for the regression The paper is outlined as follows. In Sections II and III, map. To this aim, we consider a training set D made of we present our exact optimization for the Huber loss after ` labelled instances (x1; y1);:::; (x`; y`) 2 X × R and u defining the general semi-supervised setting we consider. additional unlabelled inputs x`+1;:::; x`+u 2 X : Among the Section IV broadly discusses the HLR algorithm. As to prove class of functions (2), similarly to [3], [11], the regression map its versatility, once benchmarked with a state-of-the-art convex is found by minimizing the following objective solver in Section V-A, we registered a strong performance ` 1 X when applying HLR to curve fitting and learning with noisy J (f) = H (y − c>f(x )) + λkfk2 + γkfk2 : (3) λ,γ ` ξ i i K M labels (Section V-B), classical regression problems (Section i=1 V-C) and crowd counting application (Section V-D). Finally, 1 P` > Section VI draws the conclusions. In equation (3), ` i=1 Hξ(yi − c f(xi)) represents the empirical risk and is defined by means of the Huber loss (1): it c>f(x ) y : ULTI VIEW SCALAR REGRESSION measures how well i predicts the ground truth output i II. M - 2 For λ > 0, to avoid either under-fitting or over-fitting, kfkK In this Section, we introduce the formalism to model data is a Tichonov regularizer which controls the complexity of the points sampled from a composite input space X which is solution. Instead, γ > 0 weighs divided into multiple substructures. Precisely, we assume that 1 m α m u+` for any x 2 X ; we have x = [x ; : : : ; x ] and x belongs to 2 X X α α α α α the subspace X α; for any α = 1; : : : ; m. This is a very natural kfkM = f (xi )Mijf (xj ); (4) way to model high dimensional data: x is the concatenation α=1 i;j=1 of x1; : : : ; xm; each one representing a particular class of which infers the geometrical information of the feature space. features, that is one out of multiple views [1], [2], [11] in The family fM 1;:::;M mg of symmetric and positive semi- α (u+`)×(u+`) 2 which data may be structured. definite matrices M 2 R specifies kfkM and In order to find the regression map, we assume that it be- ensures its non-negativeness. This setting generalizes [1]–[3] 1 m longs to an hypothesis space H of functions h: X! R whose where M = ··· = M = L; being L is the graph Laplacian construction is investigated below. For any α = 1; : : : ; m; related to the (u + `) × (u + `) adjacency matrix W: The latter α α α let κ : X × X ! R a Mercer kernel [12], that is a captures the interdependencies between f(x1); : : : ; f(xu+`) symmetric and positive semi-definite function. Let us define by measuring their mutual similarity. Then, in this simplified 1 1 1 m m m m×m Pu+` 2 K(x; z) = diag(κ (x ; z ); : : : ; κ (x ; z )) 2 R ; setting, the regularizer rewrites i;j=1 wijkf(xi) − f(xj)k2 1 m 1 m where x = [x ; : : : ; x ]; z = [z ; : : : ; z ] 2 X : Consider if wij denotes the (i; j)-th entry of W: In this particular Pn case, we retrieve the idea of manifold regularization [3] to S0 the space of functions z 7! f(z) = i=1 K(z; xi)ui with x1;:::; xn 2 X and u1; : : : ; un column vectors in enforce nearby patterns in the high-dimensional RKHS to m 2 Pn > share similar labels, so imposing further regularity on the so- R : Define the norm kfkK = i;j=1 ui K(xi; xj)uj: The reproducing kernel Hilbert space (RKHS) SK related to K lution. Conversely, with our generalization, we can consider m 1 m is the completion of S0 with the limits of Cauchy sequences different graph Laplacians L ;:::;L and adjacency matrices m 1 m converging with respect to k·kK [13].