Lp Quasi- Minimization

M.E. Ashour∗, C.M. Lagoa∗ and N.S. Aybat† ∗ The Department of Electrical Engineering and Computer Science, The Pennsylvania State University. † The Department of Industrial Engineering, The Pennsylvania State University.

Abstract—The `p (0 < p < 1) quasi-norm is used as a sparsity- A sparse solution to (1) is defined as one that has a inducing function, and has applications in diverse areas, e.g., small number of entries whose magnitudes are significantly statistics, machine learning, and signal processing. This paper different than zero [1]. Indeed, many signals/images are either proposes a heuristic based on a two-block ADMM algorithm for sparse or compressible, i.e., can be approximated by a sparse tackling `p quasi-norm minimization problems. For p = s/q < 1, s, q ∈ Z+, the proposed algorithm requires solving for the roots representation with respect to some transform domain. The of a scalar degree 2q polynomial as opposed to applying a soft development of a plethora of overcomplete waveform dictio- thresholding operator in the case of `1. We show numerical results naries motivate the pursuit principle that decomposes a for two example applications, sparse signal reconstruction from signal into a sparse superposition of dictionary elements [2]. few noisy measurements and spam email classification using sup- port vector machines. Our method obtains significantly sparser Furthermore, sparsity finds application in object recognition solutions than those obtained by `1 minimization while achieving and classification problems, e.g., [5], and signal estimation similar level of measurement fitting in signal reconstruction, and from incomplete linear measurements known as compressed training and test set accuracy in classification. sensing [6], [7]. Reference [8] provides a comprehensive review of theoretical results on sparse solutions of linear I.INTRODUCTION systems and its applications in inverse problems. This paper considers problems of the form: Retrieving sparse solutions of underdetermined linear sys- tems received tremendous attention over the past two decades; p , P p see [8] and references therein. Reference [10] identifies the min kxkp i∈[n] |xi| s.t. f(x) ≤ 0, (1) x major algorithmic approaches for tackling sparse approxima- n where p ∈ (0, 1), [n] = {1, . . . , n}, n ∈ Z+, and f : R → R tion problems, namely, greedy pursuit [11], convex relaxation is a convex possibly nonsmooth function. This formulation [2], [6], [7], [12], and nonconvex optimization [13], [14]. arises in areas as machine learning and signal processing. Problems seeking sparse solutions are often posed as For instance, let {(ui, vi)}i∈[m] be a training set of feature- min{f(x) + µg(x)} for some µ > 0 and a sparsity-inducing p label pairs (ui, vi), and m ∈ Z+. In regression, one seeks penalty function g, e.g., g(x) = kxkp, where g can be to fit a model that relates vi ∈ R to ui via solving (1) with either convex, e.g., p = 1, or nonconvex, e.g., 0 ≤ p < 1. f(x) = kUx − vk2 − , where the ith row of U ∈ Rm×n is For a comprehensive reference on sparsity-inducing penalty constructed using ui, v = [vi]i∈[m], and  > 0. Alternatively, functions, see [15]. It has been shown that exact sparse signal if vi ∈ {−1, 1} denotes a class label in a binary classification recovery from few measurements is possible via letting g be problem, one might seek to find a linear classifier with a the `1 norm if the measurement matrix satisfies a certain decision rule vˆ = sign(u>x), e.g., using support vector restricted isometry property (RIP) [1], [16]. However, RIP is p machines, where the first entry of u and x are 1 and the a stringent property. Motivated by the fact that kxkp → kxk0 bias term, respectively. The classifier can be obtained by as p → 0, it is natural to consider the `p quasi-norm + 1 P >  problem (0 < p < 1). It has been shown in [17] that `p solving (1) with f(x) = m i∈[m] 1 − viui x − , where + minimization with p < 1 achieves perfect signal reconstruction (.) = max(., 0). In both examples, the `p quasi-norm of x is used in (1) as a sparsity-inducing function. Problem (1) under less restrictive isometry conditions than needed for provides a tradeoff between how well a model performs on a `1. Several references considered sparse signal reconstruction certain task versus its complexity, controlled by . via nonconvex optimization, [13], [17]–[21] to name a few. We propose an ADMM algorithm for approximating the so- In [13], it is shown that replacing `1 norm with `p quasi- lution of (1). For p = s/q < 1, the computational complexity norm, signal recovery is possible using fewer measurements. Furthermore, [13] presents a simple projected gradient descent of the proposed algorithm is similar to `1 minimization except for the additional effort of solving for the roots of a scalar method that identifies a local minimizer of the problem. An degree 2q polynomial as opposed to applying the soft thresh- algorithm that uses operator splitting and Bregman iteration methods as well as a shrinkage operator is presented in [18]. olding operator for `1. We present numerical results showing Reference [19] proposes an algorithm based on the idea of that our method significantly outperforms `1 minimization in terms of the sparsity level of obtained solutions. locally replacing the original nonconvex objective function by quadratic convex functions that are easily minimized and This work was partially supported by National Institutes of Health (NIH) establishes connection to iterative reweighted least squares Grant R01 HL142732, National Science Foundation (NSF) Grant 1808266. [22]. In [20], an point potential reduction algorithm

978-1-7281-4300-2/19/$31.00 ©2019 IEEE 726 Asilomar 2019

Authorized licensed use limited to: Penn State University. Downloaded on August 04,2020 at 13:41:52 UTC from IEEE Xplore. Restrictions apply. n 1 is proposed to compute an -KKT solution in O(  log  ) Algorithm 1: ADMM (ρ > 0) iterations, where n is the dimension of x. Reference [21] 0 0 0 0 1 Initialize: y , z , λ , θ uses ADMM and proposes a generalized shrinkage operator 2 for k ≥ 0 do for nonconvex sparsity-inducing penalty functions.  k k  k+1 k λi k θi 3 (xi, ti) ← ΠX yi − ρ , zi − ρ , ∀i ∈ [n] II.ALGORITHM k+1  k+1 λk  4 y ← ΠY x + ρ This section develops a method for approximating the k+1 k+1 θk−1 5 z ← t + solution of (1). Problem (1) is convex at p = 1; hence, can be ρ k+1 k k+1 k+1 solved efficiently and exactly. However, the problem becomes 6 λ ← λ + ρ(x − y ) k+1 k k+1 k+1 nonconvex when p < 1. An epigraph equivalent formulation 7 θ ← θ + ρ(t − z ). of (1) is obtained by introducing the variable t = [ti]i∈[n]: min 1>t x,t According to the expression of the augmented Lagrangian p s.t. ti ≥ |xi| , i ∈ [n] (2) function in (5), it follows from (6) that the variables x and t f(x) ≤ 0, are updated via solving the following nonconvex problem: k k R2 λ θ where 1 is a vector of all ones. Let the nonconvex set X ⊂ min kx − yk + k2 + kt − zk + k2 be the of the scalar function |x|p, i.e., X = {(x, t) ∈ x,t ρ ρ (10) 2 p R : t ≥ |x| }. Then, (2) can be cast as s.t. (xi, ti) ∈ X , i ∈ [n].

X > Exploiting the separable structure of (10), one immediately min 1X (xi, ti) + 1 t x,t concludes that (10) splits into n independent 2-dimensional i∈[n] (3) problems that can be solved in parallel, i.e., for each i ∈ [n], s.t. f(x) ≤ 0,  k k  k+1 k λi k θi where 1 (.) is the indicator function to the set X , i.e., it (xi, ti) = ΠX y − , z − , (11) X i ρ i ρ evaluates to zero if its argument belongs to the set X and is +∞ otherwise. ADMM exploits the structure of the problem where ΠX (.) denotes the Euclidean projection operator onto to split the optimization over the variables via iteratively the set X . Furthermore, (5) and (7) imply that y and z are solving fairly simple subproblems. In particular, we introduce independently updated as follows: auxiliary variables y = [y ] and z = [z ] and obtain ! i i∈[n] i i∈[n] λk an ADMM equivalent formulation of (3) given by: yk+1 = Π xk+1 + (12) Y ρ X > min 1X (xi, ti) + 1Y (y) + 1 z k x,t,y,z k+1 k+1 θ − 1 i∈[n] z = t + . (13) (4) ρ s.t. x = y : λ Algorithm 1 summarizes the proposed ADMM algorithm. t = z : θ, It is clear that z, λ, and θ merit closed form updates. How- where Y is the 0-sublevel set of f, i.e., Y = {y ∈ Rn : ever, updating (x, t) requires solving n nonconvex problems. f(y) ≤ 0}. The dual variables associated with the constraints Moreover, a projection onto the convex set Y is needed for x = y and t = z are λ and θ, respectively. Hence, the updating y which can lead to a heavy computational burden. Lagrangian function corresponding to (4) augmented with a In the following two sections, we present our approach for quadratic penalty on the violation of the equality constraints, handling these two concerns. with penalty parameter ρ > 0, is given by: III.NONCONVEX PROJECTION X > Lρ(x, t, y, z, λ, θ) = 1X (xi, ti) + 1Y (y) + 1 z In this section, we present the method used to tackle the i∈[n] nonconvex projection problem required to update x and t. ρ Among the advantages of the proposed algorithm is that it + λ>(x − y) + θ>(t − z) + kx − yk2 + kt − zk2. (5) 2 is amenable to decentralization. As is clear from (11), x and Consider the two block variables (x, t) and (y, z). Then, t can be updated element-wise via performing a projection ADMM [23] consists of the following iterations: operation onto the nonconvex set X , one for each i ∈ [n]. The n projection problems can be run independently in parallel. We k+1 k k k k (x, t) = argmin Lρ(x, t, y , z , λ , θ ) (6) now outline the proposed idea for solving one such projection, x,t i.e., we suppress the dependence on the index of the entry of k+1 k+1 k+1 k k 2 (y, z) = argmin Lρ(x , t , y, z, λ , θ ) (7) x and t. For (¯x, t¯) ∈ R , ΠX (¯x, t¯) entails solving y,z k+1 k k+1 k+1 min g(x, t) , (t − t¯)2 + (x − x¯)2 λ = λ + ρ(x − y ) (8) x,t k+1 k k+1 k+1 (14) θ = θ + ρ(t − z ). (9) s.t. t ≥ |x|p.

727

Authorized licensed use limited to: Penn State University. Downloaded on August 04,2020 at 13:41:52 UTC from IEEE Xplore. Restrictions apply. t¯≥ |x¯|p Π (¯x, t¯) = (¯x, t¯) s If , then trivially X . Thus, we focus on Algorithm 2: Nonconvex projection (p = q < 1) ¯ p the case in which t < |x¯| . The following proposition states 2q s 2s s q 1 R ← roots{a + (a − ta¯ ) − |x¯|a } the necessary optimality conditions for (14). q 2 R¯ ← R \ {complex numbers and negative reals in R} p ∗ ∗ q s Proposition 1. Let t¯ < |x¯| , and (x , t ) be an optimal 3 T ← {(r , r ): r ∈ R}¯ ∗ solution of (14). Then, the following properties are satisfied 4 (ˆx, t ) ← argmin {g(x, t):(x, t) ∈ T } ∗ (a) sign(x∗) = sign(¯x), 5 x ← sign(¯x)ˆx (b) t∗ ≥ t¯, (c) |x∗|p ≥ t¯, ∗ ∗ p ∗ p ∗ (d) t = |x | . t¯ ≤ |x | = t0 < t , where the first inequality follows ¯ ∗ ¯ Proof. We prove the statements by contradiction as follows: from (c). Then, 0 ≤ t0 − t < t − t. Thus, we have ∗ ∗ ∗ ∗ ∗ 2 2 (a) Suppose that sign(x ) 6= sign(¯x), then g(x , t ) − g(x , t0) = (t − t¯) − (t0 − t¯) > 0, (20)

∗ ∗ ∗ |x − x¯|=|x − 0|+|x¯ − 0| > |x¯ − 0|, (15) Furthermore, the feasibility of (x , t0) follows trivially ∗ from the choice of t0. Thus, (x , t0) is feasible and attains ∗ 2 2 ∗ ∗ ∗ i.e., (x − x¯) > (0 − x¯) . Hence, g(x , t ) − g(0, t ) > a lower objective value than that attained by (x∗, t∗). This ∗ ∗ ∗ 0. Moreover, the feasibility of (x , t ) implies that t > contradicts the optimality of (x∗, t∗). 0 (0, t∗) . Thus, is feasible and attains a lower objective This concludes the proof. value than that attained by (x∗, t∗). This contradicts the optimality of (x∗, t∗). We now make use of the fact that for (14), an optimal (b) Assume that t∗ < t¯. Then, solution (x∗, t∗) satisfies that t∗ = |x∗|p and hence, (14) reduces to solving g(x∗, t∗) − g(x∗, t¯) = (t∗ − t¯)2 > 0. (16) min (|x|p − t¯)2 + (x − x¯)2. (21) Furthermore, by the feasibility of (x∗, t∗), we have x |x∗|p ≤ t∗ < t¯. Thus, (x∗, t¯) is feasible and attains a The first order necessary optimality condition for (21) implies lower objective value than that attained by (x∗, t∗). This the following: contradicts the optimality of (x∗, t∗). p|x∗|p−1sign(x∗)(|x∗|p − t¯) + x∗ − x¯ = 0. (22) (c) Suppose that |x∗|p < t¯, i.e., p 1 ∗ 1 By the symmetry of the function |x| , assume without loss −t¯p < x < t¯p . (17) ∗ s of generality that x > 0 and let 0 < p = q < 1 for some Z q ∗ We now consider two cases, x¯ > 0 and x¯ < 0. First, let s, q ∈ +. A change of variables a = x plugged in (22) ∗ 1 shows that finding an optimal solution for (14) reduces to x¯ > 0. Then, we have by (a) and (17) that 0 < x < t¯p . 1 finding a root of the following scalar degree 2q polynomial: Since t¯ < |x¯|p, i.e., (¯x, t¯) ∈/ X , therefore t¯p < x¯ and 1 ∗ ¯p p ¯ 2q s 2s s q hence, 0 < x < t < x¯. Pick x0 > 0 such that |x0| = t, a + a − ta¯  − xa¯ . (23) 1 ∗ q i.e., x0 = t¯p . Then clearly, x < x0 < x¯. Thus, we have ∗ Thus, to find ΠX (¯x, t¯), solve for a root a of the polynomial ∗ ∗ ∗ ∗ 2 2 q s g(x , t ) − g(x0, t ) = (x − x¯) − (x0 − x¯) > 0, (18) in (23) such that (a∗ , a∗ ) minimizes g(x, t). Algorithm 2 where the last inequality follows from the just proven summarizes the method we use to solve problem (14). In case ∗ x¯ = 0, we set x∗ = t∗ = 0. If the set R¯ is empty, we set identity that x < x0 < x¯. Moreover, we have by (b) p ∗ ∗ x∗ = 0 and t∗ = (t¯)+. that |x0| = t¯ ≤ t . Thus, (x0, t ) is feasible and attains (x∗, t∗) a lower objective value than that attained by . This IV. DISTRIBUTEDCONVEXPROJECTION contradicts the optimality of (x∗, t∗). On the other hand, let x¯ < 0. Then, we have by (a) and Step 4 of Algorithm 1 involves solving a convex projection 1 problem. Although convex, solving this problem can poten- (17) that −t¯p < x∗ < 0. Since t¯ < |x¯|p, i.e., (¯x, t¯) ∈/ X , 1 1 tially be a computational bottleneck in some applications, or then t¯p < |x¯|, i.e., x¯ < −t¯p . Therefore, not at all feasible in some others. For example, designing 1 ∗ x¯ < −t¯p < x , (19) a classifier with a huge training set renders the projection problem computationally intensive. Moreover, if the data is p 1 Pick x0 < 0 such that |x0| = t¯, i.e., x0 = −t¯p . Then, distributed among multiple entities, i.e., each entity has a p ∗ (18) also holds when x¯ < 0. Note that |x0| = t¯ ≤ t by private local data set, then a centralized solution to the ∗ (b). Thus, (x0, t ) is feasible and attains a lower objective projection problem might not be a viable option. value than that attained by (x∗, t∗). This contradicts the In many applications, the function f(x) in (1) evaluates the optimality of (x∗, t∗). loss incurred by x averaged on a data set T , i.e., f merits a (d) The feasibility of (x∗, t∗) eliminates the possibility that separable structure with respect to the data. Thus, we propose ∗ ∗ p ∗ ∗ p ∗ p t < |x | . Now let t > |x | and pick t0 = |x | . Then, to solve the projection problem using a scatter-gather type of

728

Authorized licensed use limited to: Penn State University. Downloaded on August 04,2020 at 13:41:52 UTC from IEEE Xplore. Restrictions apply. algorithm in which a master node distributes the computational 0.5 load over a group of workers N . Each worker performs a local p=0.5 p=1 few-shot learning, while the master combines their decisions. 0.45 0.4 Let T = {(uj, vj)}j∈[m] be a training set of feature- 0.35

label pairs with m examples, and consider f(x) = / n 1 P 0 m j∈[m] `(uj, vj, x) − , where ` is some convex loss ||x|| 0.3 function. Then, ΠY (y¯) entails solving: 0.25

1 2 1 X 0.2 min ky − y¯k s.t. `(uj, vj, y) ≤ . (24) y 2 m 0.15 j∈[m] 10-4 10-3 10-2 10-1 100 σ2 Define {Ti}i∈N as a partition of T , where the ith set of training examples Ti is assigned to worker i ∈ N . Indeed, Ti Fig. 1. Effect of noise variance on the sparsity of solutions obtained by `p might be a local private data set generated at worker i ∈ N . quasi-norm and `1 norm minimization. Instead of solving (24) in a centralized setting, the master broadcasts y¯ to all workers, each worker finds is considered a nonzero element and we compute the zero norm of x accordingly. At each value of σ2, after the gener- ∗ 1 2 1 X ation of the corresponding measurement vector v, we choose yi = argmin kyi −y¯k s.t. `(uj, vj, yi) ≤ , yi 2 |Ti| j∈Ti  = kUxopt − vk/kvk. Obviously, this choice of  is not (25) possible in practice as xopt is not known a priori. Nevertheless, ∗ reports its local decision yi to the master, then the master its choice is motivated by the inherent dependence of  and combines the workers’ decision simply via averaging, i.e., σ2. One chooses  depending on the noise level, i.e., the ∗ P |Ti| ∗ ∗ y = i∈N m yi . This y is the updated y variable with higher the noise variance the higher the value of . Fig. 1 which Algorithm 1 proceeds. shows that our method produces sparser solutions than those obtained by `1 when the noise level is relatively high. It is V. NUMERICAL RESULTS worth mentioning that the solutions produced by Algorithm We conduct numerical experiments on two families of prob- 1 are feasible with respect to (1). Furthermore, the behavior lems: i) sparse signal reconstruction from noisy measurements, depicted by Fig. 1 persists for any problem instance generated and ii) binary classification using support vector machines. We as described above. compare our method on problem (1) at p = 0.5 with a CVX [26] solution of (1) at p = 1. B. Binary classification We use support vector machines to build a spam email A. Sparse signal reconstruction classifier. The training set used is a of the SpamAssassin Let n = 210, m = n , and randomly construct a matrix 4 Public Corpus [24]. Let {(uj, vj)}j∈[m] be the training set U ∈ Rm×n such that U = [M, −M] and M is a sparse n of feature vectors uj ∈ {0, 1} with corresponding labels binary random matrix with a few number of ones in each vj ∈ {−1, 1} identifying whether the email is spam or not. We column. More precisely, the number of ones in each column do not dwell on the details related to feature extraction since of M is generated independently and randomly in the range of it is not the message that this experiment conveys. Instead, we integers between 10 and 20, and their locations are randomly highlight the effectiveness of our method in deciding whether chosen independently for each column. Sparse binary random an email is spam or not based on as few number of words matrices satisfy a weak form of the RIP, and following the as possible. Indeed, the body of an email is preprocessed setup in [20], the concatenation of two copies of the same based on ideas promoted by [25]. Following [25], we maintain random matrix in U implies that column orthogonality is a vocabulary list of n = 1899 words. Then, for a given not maintained. A reference signal x is generated such opt preprocessed email j ∈ [m], the wth entry of uj is 1 if word that kxoptk0 = d0.2ne. The indices of the nonzero entries w in the vocabulary list appears in email j, and it is zero of xopt are uniform randomly chosen, and the value of otherwise. We build a linear classifier with a decision rule each nonzero entry is generated according to a zero-mean vˆ = sign(u>x), where u is the feature vector of the email in unit-variance Gaussian distribution. The measurement vector question, and x is the linear classifier’s coefficients with its v = Uxopt + n, where n is a zero-mean Gaussian noise first entry being the bias term. 2 vector with covariance σ I, and I is the identity matrix. We The purpose is to build a classifier that attains a high reconstruct a sparse signal from the noisy measurements v accuracy on the training set and simultaneously uses the least via solving (1) with f(x) = kUx − vk/kvk − , i.e., f(x) possible number of words from the vocabulary list to detect measures the relative residual corresponding to x. whether an email is spam or legit, i.e., a low-complexity Fig. 1 plots the sparsity level of solutions obtained by both classifier. We build the classifier by solving (1) with: `p quasi-norm and `1 norm minimization at various noise variance levels. In particular, we solve (1) at p = 1/2 and 1 X > + f(x) = 1 − vju x − . (26) −6 m j p = 1. An entry of x whose absolute value exceeds 10 j∈[m]

729

Authorized licensed use limited to: Penn State University. Downloaded on August 04,2020 at 13:41:52 UTC from IEEE Xplore. Restrictions apply. 250 minimization problem. The algorithm is computationally ef- p=1 p=0.5 ficient, where its complexity is bounded by the effort of 200 finding a root for a scalar degree 2q polynomial for a rational 0 < p = s/q < 1. The method is numerically shown to 150 significantly outperform `1 in terms of sparsity of generated solutions at similar performance levels in different tasks. 100 REFERENCES 50 [1] Cands E.J. and Tao T. Decoding by . IEEE transac-

Number of words selected for classification tions on information theory, 2005;51(12):4203-15. 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 [2] Chen S.S., Donoho D.L. and Saunders M.A. Atomic decomposition by ǫ basis pursuit. SIAM review, 2001;43(1):129-59. [3] Taubman D.S. and Marcellin M.W. JPEG 2000: Image Compression Fig. 2. Number of words selected for classification versus  for `p quasi-norm Fundamentals. Standards and Practice, Kluwer Academic, Norwell, MA; and `1 norm minimization. 2001. [4] Mallat S. A wavelet tour of signal processing. Academic press; 1999. 100 Training set accuracy for p=1 [5] Wright J., Yang A.Y., Ganesh A., Sastry S.S., Ma Y. Robust face recogni- Training set accuracy for p=0.5 95 tion via sparse representation. IEEE transactions on pattern analysis and Test set accuracy for p=1 Test set accuracy for p=0.5 machine intelligence, 2009;31(2):210-27. 90 [6] Cands E.J., Romberg J. and Tao, T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. 85 IEEE Transactions on information theory, 2006;52(2):489-509.

80 [7] Donoho D.L. Compressed sensing. IEEE Transactions on information theory, 2006;52(4):1289-306. 75 [8] Bruckstein A.M., Donoho D.L. and Elad M. From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Percentage of correct decisions 70 review, 2009;51(1):34-81. 65 [9] Natarajan B.K. Sparse approximate solutions to linear systems. SIAM 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 ǫ journal on computing, 1995;24(2):227-34. [10] Tropp J.A. and Wright S.J. Computational methods for sparse solution of linear inverse problems. Proceedings of the IEEE, 2010;98(6):948-58. Fig. 3. Training and test set accuracies versus  for `p quasi-norm and `1 norm minimization. [11] Tropp J.A. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information theory, 2004;50(10):2231-42. [12] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of The desired training set accuracy is controlled by choosing , the Royal Statistical Society, Series B (Methodological), 1996;1:267-88. i.e., the lower the value of  the higher the accuracy. Clearly, [13] Chartrand R. Exact reconstruction of sparse signals via nonconvex f not only urges u>x to have the same sign as v , but it also minimization. IEEE Signal Processing Letters, 2007;14(10):707-10. j j [14] Chartrand R. and Yin W. Iteratively reweighted algorithms for com- > urges that to happen with a margin, i.e., it tries to push uj x pressive sensing. IEEE international conference on acoustics, speech and below −1 if vj = −1 to incur no cost from the jth training signal processing, 2008, pp. 3869-3872. > [15] Bach F., Jenatton R., Mairal J. and Obozinski G. Optimization with example. Likewise, it tries to push uj x above 1 if vj = 1. sparsity-inducing penalties. Foundations and Trends in Machine Learning, We run Algorithm 1 with p = 1/2 on 2000 training emails 2012;4(1):1-106. at various values of . We terminate the algorithm after 100 [16] Donoho D.L. For most large underdetermined systems of linear equa- tions the minimal `1 norm solution is also the sparsest solution. Com- iterations at each value of , and evaluate the performance of munications on pure and applied mathematics, 2006;59(6):797-829. the learned classifier on a test set of 1000 emails. We solve (1) [17] Saab R, Chartrand R and Yilmaz O. Stable sparse approximations at p = 1 for comparing its performance with that obtained by via nonconvex optimization. IEEE international conference on acoustics, speech and signal processing, 2008, pp. 3885-3888. Algorithm 1 at p = 1/2. Fig. 2 shows the number of nonzero [18] Chartrand R. Fast algorithms for nonconvex compressive sensing: MRI entries in the classifier’s coefficients x learned at p = 1 and reconstruction from very few data. IEEE international symposium on p = 1/2. An entry of x is considered nonzero if its absolute biomedical imaging, 2009, pp. 262-265. −4 [19] Mourad N.and Reilly J.P. Minimizing nonconvex functions for value is above 10 . The corresponding training and test set sparse vector reconstruction. IEEE Transactions on Signal Processing, accuracies for the obtained classifiers are plotted in Fig. 3. Fig. 2010;58(7):3485-3496. 2 and 3 together show that our method obtains classifiers that [20] Ge D., Jiang X. and Ye Y. A note on the complexity of Lp minimization. Mathematical programming, 2011;129(2):285-299. use significantly less number of words to make a decision on [21] Chartrand R. and Wohlberg B. A nonconvex ADMM algorithm for the legitimacy of an email while achieving almost similar level group sparsity with sparse groups. In IEEE International Conference on of accuracy on both the training and test sets. This behavior Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6009-6013. [22] Gorodnitsky I.F., and Rao B.D. Sparse signal reconstruction from limited is consistent for all values of . For instance, at  = 0.2, `0.5 data using FOCUSS: A re-weighted minimum norm algorithm. IEEE minimization uses 11 words from the vocabulary list to make Transactions on Signal Processing, 1997;45(3):600-616. decisions as opposed to 40 words used by `1 minimization. [23] Boyd S., Parikh N., Chu E., Peleato B., Eckstein J. Distributed opti- mization and statistical learning via the alternating direction method of Nevertheless, both `0.5 and `1 attain about the same training multipliers. Foundations and Trends in Machine learning. 2011;3(1):1-22. and test set accuracies, 93% and 90%, respectively. [24] SpamAssassin Public Corpus, http://spamassassin.apache.org [25] Andrew Ng, Machine learning MOOC, available online: VI.CONCLUSION https://www.coursera.org/learn/machine-learning [26] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined We present a nonconvex ADMM algorithm that approx- convex programming, version 2.0 beta. http://cvxr.com/cvx, 2013. imates the solution of the `p (0 < p < 1) quasi-norm

730

Authorized licensed use limited to: Penn State University. Downloaded on August 04,2020 at 13:41:52 UTC from IEEE Xplore. Restrictions apply.