1
Correlated Logistic Model with Elastic Net Regularization for Multilabel Image Classification Qiang Li, Bo Xie, Jane You, Wei Bian, Member, IEEE, and Dacheng Tao, Fellow, IEEE
Abstract—In this paper, we present correlated logistic model and multi-task learning [4]. Because of its great generality and (CorrLog) for multilabel image classification. CorrLog extends wide applications, MLC has received increasing attentions in conventional Logistic Regression model into multilabel cases, via recent years from machine learning, data mining, to computer explicitly modelling the pairwise correlation between labels. In addition, we propose to learn model parameters of CorrLog vision communities, and developed rapidly with both algorith- with Elastic Net regularization, which helps exploit the sparsity mic and theoretical achievements [5], [6], [7], [8], [9], [10]. in feature selection and label correlations and thus further The key feature of MLC that makes it distinct from SLC boost the performance of multilabel classification. CorrLog can is label correlation, without which classifiers can be trained be efficiently learned, though approximately, by regularized independently for each individual label and MLC degenerates maximum pseudo likelihood estimation (MPLE), and it enjoys a satisfying generalization bound that is independent of the to SLC. The correlation between different labels can be 2 number of labels. CorrLog performs competitively for multil- verified by calculating the statistics, e.g., χ test and Pearson’s abel image classification on benchmark datasets MULAN scene, correlation coefficient, of their distributions. According to [11], MIT outdoor scene, PASCAL VOC 2007 and PASCAL VOC there are two types of label correlations (or dependence), i.e., 2012, compared to the state-of-the-art multilabel classification the conditional correlations and the unconditional correlations, algorithms. wherein the former describes the label correlations conditioned Index Terms—Correlated logistic model, elastic net, multilabel on a given instance while the latter summarizes the global classification. label correlations of only label distribution by marginalizing out the instance. From a classification point of view, modelling I.INTRODUCTION of label conditional correlations is preferable since they are Multilabel classification (MLC) extends conventional single directly related to prediction; however, proper utilization of un- label classification (SLC) by allowing an instance to be conditional correlations is also helpful, but in an average sense assigned to multiple labels from a label set. It occurs naturally because of the marginalization. Accordingly, quite a number of MLC algorithms have been proposed in the past a few years, from a wide range of practical problems, such as document 1 categorization, image classification, music annotation, web- by exploiting either of the two types of label correlations, and page classification and bioinformatics applications, where each below, we give a brief review of the representative ones. As instance can be simultaneously described by several class it is a very big literature, we cannot cover all the algorithms. labels out of a candidate label set. MLC is also closely related The recent surveys [8], [9] contain many references omitted to many other research areas, such as subspace learning [1], from this paper. nonnegative matrix factorization [2], multi-view learning [3] • By exploiting unconditional label correlations: A large class of MLC algorithms that utilize unconditional label correlations are built upon label transformation. The key This research was supported in part by Australian Research Council Projects FT-130101457, DP-140102164 and LE-140100061. The funding support from idea is to find new representation for the label vector the Hong Kong government under its General Research Fund (GRF) scheme (one dimension corresponds to an individual label), so arXiv:1904.08098v1 [cs.CV] 17 Apr 2019 (Ref. no. 152202/14E) and the Hong Kong Polytechnic University Central that the transformed labels or responses are uncorrelated Research Grant is greatly appreciated. Q. Li is with the Centre for Quantum Computation and Intelligent Systems, and thus can be predicted independently. Original label Faculty of Engineering and Information Technology, University of Technology vector needs to be recovered after the prediction. MLC Sydney, 81 Broadway, Ultimo, NSW 2007, Australia, and also with Depart- algorithms using label transformation include [12] which ment of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong (e-mail: [email protected]). utilizes low-dimensional embedding and [7] and [13] W. Bian and D. Tao are with the Centre for Quantum Computation and which use random projections. Another strategy of using Intelligent Systems, Faculty of Engineering and Information Technology, unconditional label correlations, e.g., used in the stacking University of Technology Sydney, 81 Broadway, Ultimo, NSW 2007, Australia (e-mail: [email protected], [email protected]). method [6] and the “Curds” & “Whey” procedure [14], B. Xie is with College of Computing, Georgia Institute of Technology, is first to predict each individual label independently and Atlanta, GA 30345, USA (email: [email protected]). correct/adjust the prediction by proper post-processing. J. You is with Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong (e-mail: Algorithms are also proposed based on co-occurrence [email protected]). or structure information extracted from the label set, c 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, 1Studies on MLC, from different perspectives rather than label correlations, including reprinting/republishing this material for advertising or promotional also exit in the literature, e.g., by defining different loss functions, dimension purposes, creating new collective works, for resale or redistribution to servers reduction and classifier ensemble methods, but are not in the scope of this or lists, or reuse of any copyrighted component of this work in other works. paper. 2
which include random k-label sets (RAKEL) [15], pruned construction and a piecewise procedure is utilized to train problem transformation (PPT) [16], hierarchical binary the pairwise CRFs model. More recently, clique generating relevance (HBR) [17] and hierarchy of multilabel classi- machine (CGM) [45] proposed to learn the image label graph fiers (HOMER) [8]. Regression-based models, including structure and parameters by iteratively activating a set of reduced-rank regression and multitask learning, can also cliques. It also belongs to the CRFs framework, but the labels be used for MLC, with an interpretation of utilizing are not constrained to be all connected which may result in unconditional label correlations [11]. isolated cliques. • By exploiting conditional label correlations: MLC algo- rithms in this category are diverse and often developed B. Motivation and Organization by specific heuristics. For example, multilabel K-nearest Correlated logistic model (CorrLog) provides a more princi- neighbour (MLkNN) [5] extends KNN to the multilabel pled way to handle conditional label correlations, and enjoys situation, which applies maximum a posterior (MAP) several favourable properties: 1) built upon independent lo- label prediction by obtaining the prior label distribution gistic regressions (ILRs), it offers an explicit way to model within the K nearest neighbours of an instance. Instance- the pairwise (second order) label correlations; 2) by using the based logistic regression (IBLR) [6] is also a localized pseudo likelihood technique, the parameters of CorrLog can be algorithm, which modifies logistic regression by using learned approximately with a computational complexity linear label information from the neighbourhood as features. with respect to label number; 3) the learning of CorrLog is Classifier chain (CC) [18], as well as its ensemble and stable, and the empirically learned model enjoys a general- probabilistic variants [19], incorporate label correlations ization error bound that is independent of label number. In into a chain of binary classifiers, where the prediction addition, the results presented in this paper extend our previous of a label uses previous labels as features. Channel study [46] in following aspects: 1) we introduce elastic net coding based MLC techniques such as principal label regularization to CorrLog, which facilitates the utilization of space transformation (PLST) [20] and maximum margin the sparsity in both feature selection and label correlations; 2) output coding (MMOC) [21] proposed to select codes that a learning algorithm for CorrLog based on soft thresholding exploits conditional label correlations. Graphical models, is derived to handle the nonsmoothness of the elastic net e.g., conditional random fields (CRFs) [22], are also regularization; 3) the proof of generalization bound is also applied to MLC, which provides a richer framework to extended for the new regularization; 4) we apply CorrLog to handle conditional label correlations. multilabel image classification, and achieve competitive results with the state-of-the-art methods of this area. A. Multilabel Image Classification To ease the presentation, we first summarize the important Multilabel image classification belongs to the generic scope notations in Table I. The rest of this paper is organized as of MLC, but handles the specific problem of predicting the follows. Section II introduces the model CorrLog with elastic presence or absence of multiple object categories in an image. net regularization. Section III presents algorithms for learning Like many related high-level vision tasks such as object CorrLog by regularized maximum pseudo likelihood estima- recognition [23], [24], visual tracking [25], image annotation tion, and for prediction with CorrLog by message passing. A [26], [27], [28] and scene classification [29], [30], [31], generalization analysis of CorrLog based on the concept of multilabel image classification [32], [33], [34], [35], [36], algorithm stability is presented in Section IV. Section V to [37] is very challenging due to large intra-class variation. In Section VII report results of empirical evaluations, including general, the variation is caused by viewpoint, scale, occlusion, experiments on synthetic dataset and on benchmark multilabel illumination, semantic context, etc. image classification datasets. On the one hand, many effective image representation schemes have been developed to handle this high-level vision II.CORRELATED LOGISTIC MODEL task. Most of the classical approaches derive from handcrafted We study the problem of learning a joint prediction y = image features, such as GIST [38], dense SIFT [39], VLAD d(x): X 7→ Y, where the instance space X = {x : kxk ≤ [40], and object bank [41]. In contrast, the very recent deep 1, x ∈ RD} and the label space Y = {−1, 1}m. By assuming learning techniques have also been developed for image fea- the conditional independence among labels, we can model ture learning, such as deep CNN features [42], [43]. These MLC by a set of independent logistic regressions (ILRs). techniques are more powerful than classical methods when Specifically, the conditional probability plr(y|x) of ILRs is learning from a very large amount of unlabeled images. given by On the other hand, label correlations have also been ex- m Y ploited to significantly improve image classification perfor- plr(y|x) = plr(yi|x) mance. Most of the current multilabel image classification i=1 (1) m T algorithms are motivated by considering label correlations Y exp yiβ x = i , conditioned on image features, thus intrinsically falls into exp βT x + exp −βT x the CRFs framework. For example, probabilistic label en- i=1 i i D hancement model (PLEM) [44] designed to exploit image where βi ∈ R is the coefficients for the i-th logistic label co-occurrence pairs based on a maximum spanning tree regression (LR) in ILRs. For the convenience of expression, 3
TABLE I In this paper, we choose q(y) to be the following quadratic SUMMARY OF IMPORTANT NOTATIONS THROUGHOUT THIS PAPER. form, X Notation Description q(y) = exp αijyiyj . (3) D = {x(l), y(l)} training dataset with n examples, 1 ≤ l ≤ n i
studies that `1 regularized algorithms are inherently unstable, Accordingly, the negative log pseudo likelihood over the that is, a slight change of the training data set can lead training data D is given by to substantially different prediction models. Based on above n m 1 X X (l) (l) consideration, we propose to use the elastic net regularizer Le(Θ) = − log p(y |y , x(l); Θ). (12) n i −i [49], which is a combination of `2 and `1 regularizers and l=1 i=1 inherits their individual advantages, i.e., learning stability and model sparsity, To this end, the optimal model parameter Θe = {βe, αe} of m CorrLog can be learned approximately by the elastic net X 2 regularized MPLE, Ren(Θ; λ1, λ2, ) = λ1 (kβik2 + kβik1) i=1 Θe = arg min Ler(Θ) X 2 Θ + λ2 (|αij| + |αij|), (8) i n A. Approximate Learning via Pseudo Likelihood (k) 1 P (l) (k) ∇Jsβi (Θ ) = n ξlix + 2λ1βi l=1 Maximum pseudo likelihood estimation (MPLE) [50] pro- n (k) 1 P (l) (l) (k) vides an alternative approach for estimating model parameters, ∇Jsαij (Θ ) = n ξliyj + ξljyi + 2λ2αij l=1 especially when the partition function of the likelihood cannot (15) be evaluated efficiently. It was developed in the field of with spatial dependence analysis and has been widely applied to the estimation of various statistical models, from the Ising ξli = (l) model [47] to the CRFs [51]. Here, we apply MPLE to the −2yi n o . learning of parameter Θ in CorrLog. (l) (k)T (l) Pm (k) (l) Pi−1 (k) (l) 1 + exp 2yi βi x + j=i+1 αij yj + j=1 αji yj The pseudo likelihood of the model over m jointly dis- (16) tributed random variables is defined as the product of the Then, a surrogate J(Θ) of the objective function Ler(Θ) in conditional probability of each individual random variables (k) (13) can be obtained by using ∇Js(Θ ), i.e., conditioned on all the rest ones. For CorrLog (4), its pseudo (k) (k) likelihood is given by J(Θ; Θ ) = Js(Θ ) m m X (k) (k) 1 (k) 2 Y + h∇Jsβ (Θ ), βi − β i + kβi − β k2 + λ1kβik1 p(y|x; Θ) = p(yi|y−i, x; Θ), (10) i i 2η i e i=1 i=1 X (k) (k) 1 (k) 2 + h∇J (Θ ), α − α i + (α − α ) + λ |α |. sαij ij ij 2η ij ij 2 ij where y−i = [y1, ..., yi−1, yi+1, ..., ym] and the conditional i p(yi|y−i, x; Θ) = The parameter η in (17) servers a similar role to the variable 1 . updating step size in gradient descent methods, and it is n T Pm Pi−1 o 1 + exp −2yi βi x + j=i+1 αijyj + j=1 αjiyj set such that 1/η is larger than the Lipschitz constant of (k) (11) ∇Js(Θ ). For such η, it can be shown that J(Θ) ≥ Ler(Θ) 5 Algorithm 1 Learning CorrLog by Maximum Pseudo Likeli- the last subsection, the prediction of yb for a new instance x hood Estimation with Elastic Net Regularization can be obtained by Input: Training data D, initialization β(0) = 0, α(0) = y = arg max p(y|x; Θ)e 0, and learning rate η, where 1/η is set larger than the b y∈Y Lipschitz constant of ∇Js(Θ) (17). m (t) (t) X T X Output: Model parameters Θe = (βe , α ). = arg max exp yiβei x + αijyiyj . (21) e y∈Y e repeat i=1 i (k) (k) and J(Θ ) = Ler(Θ ). Therefore, the update of Θ can be realized by the minimization IV. GENERALIZATION ANALYSIS An important issue in designing a machine learning algo- Θ(k+1) = arg min J(Θ; Θ(k)), (18) Θ rithm is generalization, i.e., how the algorithm will perform on the test data compared to on the training data. In the section, which is solved by the soft thresholding function S(·), i.e., we present a generalization analysis for CorrLog, by using the concept of algorithmic stability [55]. Our analysis follows two ( (k+1) (k) (k) βi = S(βi − η∇Jsβi (Θ ); λ1) steps. First, we show that the learning of CorrLog by MPLE (k+1) (k) (k) (19) αij = S(αij − η∇Jsαij (Θ ); λ2), is stable, i.e., the learned model parameter Θe does not vary much given a slight change of the training data set D. Then, we where prove that the generalization error of CorrLog can be bounded by the empirical error, plus a term related to the stability but u − 0.5ρ, if u > 0.5ρ S(u; ρ) = u + 0.5ρ, if u < −0.5ρ (20) independent of the label number m. 0, otherwise. Iteratively applying (19) until convergence provides a first- A. The Stability of MPLE order method for solving (13). Algorithm 1 presents the The stability of a learning algorithm indicates how much pseudo code for this procedure. the learned model changes according to a small change of Remark 1 From the above derivation, especially equations the training data set. Denote by Dk a modified training data (15) and (19), the computational complexity of our learning set the same with D but replacing the k-th training example algorithm is linear with respect to the label number m. There- (x(k), y(k)) by another independent example (x0, y0). Suppose fore, learning CorrLog is no more expensive than learning Θe and Θe k are the model parameters learned by MPLE (13) on m independent logistic regressions, which makes CorrLog D and Dk, respectively. We intend to show that the difference scalable to the case of large label numbers. between these two models, defined as Remark 2 It is possible to further speed up the learning m k algorithm. In particular, Algorithm 1 can be modified to have k X X k kΘe − Θek , kβei − βeik + |αe ij − αe ij|, ∀ 1 ≤ k ≤ n, the optimal convergence rate in the sense of Nemirovsky i=1 i \k Lemma 1. Given Ler(·) and Θe defined in (13), and Θe That is log p(yi|y−i, x; Θ) is Lipschitz continuous with re- defined in (23), it holds for ∀1 ≤ k ≤ n, spect to βi and αij, with constant 2 and 4, respectively. \k Therefore, (29) holds. Ler(Θe ) − Ler(Θ)e ≤ m m ! By combining the above three Lemmas, we have the fol- 1 X (k) (k) (k) \k X (k) (k) (k) log p(y |y , x ; Θe ) − log p(y |y , x ; Θ)e n i −i i −i lowing Theorem 1 that shows the stability of CorrLog. i=1 i=1 (25) Theorem 1. Given model parameters Θe = {βe, α} and Θe k = k e k k {βe , αe } learned on training datasets D and D , respectively, Proof. Denote by RHS the righthand side of (25), we have both by (13), it holds that RHS = L (Θ\k) − L\k(Θ\k) − L (Θ) − L\k(Θ) . m er e er e er e er e X k X k 16 kβei − βeik + |αe ij − αe ij| ≤ . (30) \k \k \k min(λ1, λ2)n Furthermore, the definition of Θe implies Ler (Θe ) ≤ i=1 i \k \k 2 \k 2 Further, by using Ler(Θe ) − Ler(Θ)e ≥ λ1kβe − βek + λ2kαe − αek . (26) \k 2 \k 2 Proof. We define the following function kβe − βek + kαe − αek ≥ 2 2 2 m f(Θ) = Ler(Θ) − λ1kβ − βek − λ2kα − αek . 1 X \k X \k kβei − βe k + |αij − α | (32) 2 i e eij Then, for (26), it is sufficient to show that f(Θe \k) ≥ f(Θ)e . i=1 i Since Θe minimizes Ler(Θ), we have 0 ∈ ∂Ler(Θ)e and thus 0 ∈ ∂f(Θ)e , which implies Θe also minimizes f(Θ). Therefore f(Θ)e ≤ f(Θe \k). B. Generalization Bound In addition, by checking the Lipschitz continuous property We first define a loss function to measure the generalization of log p(yi|y−i, x; Θ), we have the following Lemma 3. error. Considering that CorrLog predicts labels by MAP esti- mation, we define the loss function by using the log probability \k Lemma 3. Given Θe defined in (13) and Θe defined in (23), it holds for ∀ (x, y) ∈ X × Y and ∀1 ≤ k ≤ n 1, f(x, y, Θ) < 0 1 − f(x, y, Θ)/γ, 0 ≤ f(x, y, Θ) < γ m m `(x, y; Θ) = X X log p(y |y , x; Θ) − log p(y |y , x; Θ\k) 0, f(x, y, Θ) ≥ γ, i −i e i −i e (35) i=1 i=1 m where the constant γ > 0 and X \k X \k ≤ 2 kβei − βei k + 4 |αeij − αeij |. (29) f(x, y, Θ) = log p(y|x; Θ) − max log p(y0|x; Θ) i=1 i The loss function (35) is defined analogously to the loss Then, the proof is completed by applying Theorem 1. function used in binary classification, where f(x, y, Θ) is Now, we are ready to present the main theorem on the replaced with the margin ywT x if a linear classifier w is generalization ability of CorrLog. used. Besides, (35) gives a 0 loss only if all dimensions of y are correctly predicted, which emphasizes the joint prediction Theorem 3. Given the model parameter Θe learned by (13), in MLC. By using this loss function, the generalization error with i.i.d. training data D = {(x(l), y(l)) ∈ X × Y, l = and the empirical error are given by 1, 2, ..., n} and regularization parameters λ1, λ2, it holds with at least probability 1 − δ, R(Θ)e = Exy`(x, y; Θ)e , (37) 32 and R(Θ)e ≤ Re(Θ)e + n γ min(λ , λ )n 1 X 1 2 Re(Θ)e = `(x(l), y(l); Θ)e . (38) r n 64 log 1/δ l=1 + + 1 . (46) γ min(λ1, λ2) 2n According to [55], an exponential bound exists for R(Θ) e Proof. Given Theorem 2, the generalization bound (46) is if CorrLog has a uniform stability with respect to the loss a direct result of Theorem 12 in [55] (Please refer to the function (35). The following Theorem 2 shows this condition reference for details). holds. Remark 3 A notable observation from Theorem 3 is that the Theorem 2. Given model parameters Θe = {βe, α} and Θe k = k e generalization bound (46) of CorrLog is independent of the k k {βe , αe } learned on training datasets D and D , respectively, label number m. Therefore, CorrLog is preferable for MLC both by (13), it holds for ∀(x, y) ∈ X × Y, with a large number of labels, for which the generalization 32 error still can be bounded with high confidence. |`(x, y; Θ)e − `(x, y; Θe k)| ≤ . (39) γ min(λ1, λ2)n Remark 4 While the learning of CorrLog (13) utilizes the Proof. First, we have the following inequality from (35) elastic net regularization Ren(Θ; λ1, λ2, ), where is the weighting parameter on the `1 regularization to encourage k k γ|`(x, y; Θ)e − `(x, y; Θe )| ≤ |f(x, y, Θ)e − f(x, y, Θe )| sparsity, the generalization bound (46) is independent of the (40) parameter . The reason is that `1 regularization does not Then, by introducing notation lead to stable learning algorithms [56], and only the `2 m regularization in Ren(Θ; λ1, λ2, ) contributes to the stability X T X of CorrLog. A(x, y, β, α) = yiβi x + αijyiyj, (41) i=1 i