Least Absolute Deviations Learning of Multiple Tasks
Total Page:16
File Type:pdf, Size:1020Kb
JOURNAL OF INDUSTRIAL AND doi:10.3934/jimo.2017071 MANAGEMENT OPTIMIZATION Volume 14, Number 2, April 2018 pp. 719{729 LEAST ABSOLUTE DEVIATIONS LEARNING OF MULTIPLE TASKS Wei Xue1;2, Wensheng Zhang2;3 and Gaohang Yu4 1School of Computer Science and Technology, Anhui University of Technology Maanshan 243032, China 2School of Computer Science and Engineering, Nanjing University of Science and Technology Nanjing 210094, China 3Institute of Automation, Chinese Academy of Sciences Beijing 100190, China 4School of Mathematics and Computer Sciences, Gannan Normal University Ganzhou 341000, China (Communicated by Cheng-Chew Lim) Abstract. In this paper, we propose a new multitask feature selection model based on least absolute deviations. However, due to the inherent nonsmooth- ness of l1 norm, optimizing this model is challenging. To tackle this prob- lem efficiently, we introduce an alternating iterative optimization algorithm. Moreover, under some mild conditions, its global convergence result could be established. Experimental results and comparison with the state-of-the-art al- gorithm SLEP show the efficiency and effectiveness of the proposed approach in solving multitask learning problems. 1. Introduction. Multitask feature selection (MTFS), which is aimed to learn ex- planatory features across multiple related tasks, has been successfully applied in various applications including character recognition [18], classification [21], medical diagnosis [23], and object tracking [2]. One key assumption behind many MTFS models is that all tasks are interrelated to each other. In this paper, we mainly consider regression problem, and in the problem setup, we assume that there are k regression problems or \tasks" with all data coming from the same space. For each k task, there are mj points. Hence, it consists of a dataset of D = [j=1Dj, where i i mj 1 i n Dj = f(xj; yj)gi=1 are sampled from an underlying distribution Pj , xj 2 R i denotes the i-th training sample for the j-th task, yj 2 R denotes the correspond- ing response, the superscript i indexes the independent and identically distributed (i.i.d.) observations for each task, mj is the number of samples for the j-th task, Pk and the total number of training samples is denoted by m = j=1 mj. The goal k i of MTFS is to learn k decision functions fjjj=1 such that fj(xj) approximates i yj. Typically, in multitask learning models, the decision function fj for the j- th task is assumed as a hyperplane parameterized by the model weight vector 2010 Mathematics Subject Classification. Primary: 49M27, 90C25; Secondary: 65K05. Key words and phrases. Multitask learning, feature selection, least absolute deviations, alter- nating direction method, l1 regularization. The initial version of this work was done while the first author was a Ph.D. candi- date at Nanjing University of Science and Technology. 1 0 Pj is usually assumed different for each task but all Pj s are related [4]. 719 720 WEI XUE, WENSHENG ZHANG AND GAOHANG YU n n×k wj 2 R . The objective of MTFS models is to learn a weight matrix W 2 R . To make the notation uncluttered, we express the weight matrix W into column- 1 n wise and rowwise vectors, i.e., W = [w1;:::; wk] = [w ; ::: ; w ]. For convenience, 1 mj T mj ×n let Xj = [xj ; ··· ; xj ] 2 R denote the sample matrix for the j-th task, 1 mj T mj yj = [yj ; ··· ; yj ] 2 R , and suppose that Lj(Xjwj; yj) be a loss function on the sample (Xj; yj) for task j, a standard method to finding a sparse W is to solve the following l1 minimization problem: k mj k 1 X 1 X X min Lj(Xjwj; yj) + µ kwjk1; (1) W 2 n×k k m R j=1 j i=1 j=1 where µ > 0 is the regularization parameter using to balance the loss term and the regularization term. Solving (1) leads to individual sparsity patterns for each wj. Many methods have been proposed to select features globally with the use of variants of l1 regu- larization, or more specifically, by imposing a mixed lp;1 norm, such as l2;1 matrix Pn i Pn i norm, kW k2;1 , i=1 kw k2, and l1;1 matrix norm, kW k1;1 , i=1 kw k1 = Pn i=1 maxj jWi;jj,[6, 10, 12, 17, 19, 22, 25]. As argued by [26], the advantages of the two matrix norms are that they can not only gain benefit from the l1 norm which can promote sparse solutions, but also achieve group sparsity through the lp norm. The l2;1 norm is essentially the sum of the l2 norms of the rows, and the l1;1 norm penalizes the sum of maximum absolute values of each row. One appealing property of the two matrix norms is that they encourage multiple predictors from different tasks to share similar parameter sparsity patterns and discover solutions where only a few features are nonzero. One commonly used function for Lj(Xjwj; yj) is the 2 squared loss, that is, kXjwj −yjk2, which can be viewed as a regression with Gauss- ian noise [14]. In this paper, we focus on the regression problem in the context of MTFS. Instead of choosing the conventional squared loss, we consider a nonsmooth loss function and present our MTFS model from the perspective of probability. The key contributions of this paper are highlighted as follows. • We propose a new MTFS model within probabilistic framework and develop an iterative algorithm to solve it. • We theoretically provide a convergence result of the developed algorithm. • We conduct experiments on both synthetic data and real data to show the performance of the proposed approach. The rest of this paper is organized as follows. In Section 2, we introduce the multitask learning formulation and present an optimization algorithm to solve the proposed model. In Section 3, we provide theoretical analysis of the algorithm. Experimental comparison and results are reported in Section 4. Finally, we conclude this paper with future work in Section 5. 2. Model formulation and algorithm. In this section, we first introduce our formulation for multitask learning and then present the solving process. n 2.1. Model formulation. Given xj 2 R , suppose that the corresponding output T yj 2 R for task j has a Laplacian distribution with a location parameter (wj xj) and a scale parameter σj > 0; that is to say, its probability density function is in the form: T 1 jyj − wj xjj p(yjjwj; xj; σj) = exp − : (2) 2σj σj LEAST ABSOLUTE DEVIATIONS LEARNING OF MULTIPLE TASKS 721 k Denote σ = [σ1; : : : ; σk] 2 R , and assume that the data fA; yg is drawn i.i.d. according to the distribution in (2), then the likelihood function can be written as k mj Y Y i i p(yjW; A; σ) = p(yjjwj; xj; σj): j=1 i=1 To capture the task relatedness, we impose the exponential prior on the i-th row of W , i.e., i i i i p(w jδ ) / exp(−kw k2δ ); i = 1; 2; : : : ; n; (3) where δi > 0 is the so-called rate parameter. Denote δ = [δ1; : : : ; δn] 2 Rn and assume that w1;:::; wn are drawn i.i.d. according to (3), then we can express the Qn i i prior on W as p(W jδ) = i=1 p(w jδ ). It follows that the posterior distribution of W is p(W jA; y; σ; δ) / p(yjW; A; σ)p(W jδ). With the above likelihood and prior, we can obtain a maximum posterior solution of W by solving the following optimization problem: k n X 1 X i i min kXjwj − yjk1 + δ kw k2: (4) W 2 n×k σ R j=1 j i=1 Clearly, the solution of the first term in (4) can be viewed as a Least Absolute Deviations (LAD) solution. Thus, we name (4) as LAD multitask feature selection i model. For simplicity, we assume that σj = σ for all j and δ = δ for all i. Let µ = σδ, the proposed MTFS model based on LAD is given as follows: k n X X i min kXjwj − yjk1 + µ kw k2: (5) W 2 n×k R j=1 i=1 2.2. Algorithm. In the following, we show how to solve the LAD multitask learn- ing model (5). Actually, (5) can be rewritten as a matrix factorization problem with a sparsity penalty: min kX (W ) − yk1 + µkW k2;1; (6) n×k W 2R where X : Rm×n ! Rm is a map defined by matric-vector multiplication based m m on each task, i.e., X (W ) = [X1w1; ··· ; Xkwk] 2 R , y = [y1; ··· ; yk] 2 R . By introducing an artificial variable r 2 Rm,(6) can be rewritten as n o min krk1 + µkW k2;1 : X (W ) − y = r ; (7) n×k W 2R which is a convex optimization problem and can be solved by many methods. In this paper, we adopt an alternating direction method (ADM) that minimizes the following augmented Lagrange function: β L(W; r; λ) = krk + µkW k − λT [X (W ) − y − r] + kX (W ) − y − rk2; (8) 1 2;1 2 2 where λ 2 Rm is the Lagrange multiplier, β > 0 is the penalty parameter of the linear constraint in (7). The basic idea of ADM can date back to the work of Gabay and Mercier [8], which has been applied to many different fields such as image restoration [20, 28], quadratic programming [16], online learning [24], and background-foreground extraction [27]. Given (rk; λk), ADM generates the next iteration via 8 W arg min n×k L(W; r ; λ ); < k+1 W 2R k k r arg min m L(W ; r; λ ); (9) k+1 r2R k+1 k : λk+1 λk − β[X (Wk+1) − y − rk+1]: 722 WEI XUE, WENSHENG ZHANG AND GAOHANG YU We can see from (9) that at each iteration the main computation of ADM is solving two subproblems for W and r.