arXiv:1602.07017v1 [cs.CV] 23 Feb 2016 ia 119 hax,P .China. Acade R. Chinese P. Mechanics, Shaanxi, Precision 710119, Pho and Xi’an and Optics Optics Transient of of Institute Laboratory Key State (OPTIMAL), oooyagrtm itoaylearning dictionary alg algorithm, proximal homotopy optimization, constrained algorithm, greedy oyehi nvriy ogKong Hong University, Polytechnic nvriyo cec n ehooy ajn 104 .R. P. 210094, Nanjing Technology, and Science of University seilyi inlpoesn,iaepoesn,machin processing, fields, extraordi- image application an processing, of be range signal to wide proven in a been especially to also solution has powerful and nary LRBM methodol the representative most of the s is The method 2]. representation [1, attention considerable received recently have W 105 undn,PR hn;KyLbrtr fNtokOr Network of P.R Laboratory Guangdong, Key 518055, Technol([email protected]). Shenzhen of China; Computation, Institute P.R. Intelligent Harbin Guangdong, School, 518055, Graduate Shenzhen ter, JOURNAL iainadsas ersnainwith representation sparse and mization rmvrosvepit.Freape ntrso different metho represen of the sparse groups: constraints, terms five with into sparsity in categorized roughly in be example, can used studie For minimizations be on can viewpoints. norm methods research review various representation for updated sparse guidance from an of a article taxonomy supply and this The to study of and comprehensive representation purpose sparse a main provide prop The been to representation. have sparse algorithms p different for and research Many theoretical applications. both represent in cal reputation Sparse good recognition. a proce has pattern also image and processing, signal vision of computer fields in researchers from (0 ainagrtm a rsne.TeMta oeue nthi in used http://www.yongxu.org/lunwen. at: code available Matlab repres be The sparse can presented. these paper was of algorithms study tation Spec comparative theory. representation representation experimentally sparse sparse an of reve the of sufficiently range nature could wide potential which eac a summarized, in are and applications algorithms analyzed different are r of egory sparse rationales algorithm-based The homotopy resentation. algo and optimization, proximity st based greedy optimization, groups: constrained four into algorith approximation, categorized representation empirically sparse be represent available also sparse The of provided. overview comprehensive is a paper, this In < ai hn swt h imtisRsac etr h Hong The Center, Research Biometrics the with is Zhang David uln ii ihteCne o Pia Mgr nlssan Analysis IMagery OPTical for Center the with is Li Xuelong orsodn uhr ogX eal [email protected]). (email: Xu Yong author: Corresponding inYn swt h olg fCmue cec n Technol and Science Computer of College the with is Yang Jian hn hn n ogX swt h i-optn eerhCe Research Bio-Computing the with is Xu Yong and Zhang Zheng ne Terms Index Abstract uvyo prerpeetto:agrtm and algorithms representation: sparse of survey A p < l 0 )mnmzto,sas ersnainwith representation sparse minimization, 1) nr iiiain prerpeetto with representation sparse minimization, -norm ainmtos(RM aebe elsuidand studied well been represen- have linear (LRBM) methods mathematics, tation in advancements ITH Sas ersnainhsatatdmc attention much attracted has representation —Sparse inYang, Jian Sas ersnain opesv sensing, compressive representation, —Sparse .I I. hn Zhang, Zheng NTRODUCTION ebr IEEE, Member, l 2 tdn ebr IEEE, Member, Student , 1 nr minimization. -norm uln Li, Xuelong l yo Sciences, of my applications 1 hn e-mail: China . g,Shenzhen ogy, nr mini- -norm China. ois Xi’an tonics, g,Nanjing ogy, Learning d l orithm, p scan ms ifically, rithm- rategy html. -norm cat- h tation lthe al ssing, parse racti- iented ation ation Kong osed ogy elw IEEE, Fellow, ers. en- ep- ds n- is d e s h preo opesbesga n a ensuccessfully been [4–6]. has for processing and signal method signal to sampling compressible applied signal of or novel methodology sparse the a the example, is For representation difficulti fields. 17] sparse conquer many 7–10, to in used [4, appear technique theory that outstanding representation most sparse the the is theory, prerequ CS indispensable of an As algorithm. measur reconstructing thre encoding and the representation, includes sparse always components: theory basic i CS problems on Moreover, different based fields. address to various algorithms proposed of been have number theory f large CS basis a theoretical Thus, the research. provided future and sign theory l CS different [11–17] of a of literature foundation presented these solutions All and algorithms. some sensing reconstruction on compressed interpretation of specific analysis porti prov concrete small [12] a a Baraniuk coefficients. utilizing transformation by Fourier reconstructed of s original precisely the be i.e. perspectiv theory, could mathematical CS the of rationale from the [13], demonstrated al. theore by et sampling suggested Shannon’s Candes ones as (SST). the such than theories mea less used few th much previously a compressive, are exploiting or which by sparse values, reconstructed sured is be signal can a signal [11] C original if Donoho sensing. that compressed years. suggests of recent concept theory in original topics the popular proposed first most which the [11–13], of (CS) sensing one compressed to related directly problems. capabiliti these potential huge handling shown [3–1 has segmentation representation image Sparse and de- classification denoising, image super-resolutio image tracking, restoration, as image such inpainting, vision, bluring, computer and learning, eiul fec ls mlyn h prerepresentatio sparse the employing class reconstruc each the of calculates line then the residuals and of sampl coefficients system, test representation representation the linear sparse represent the samples computes to exploits and by samples SRC training represented Specifically, that of sufficiently subject. combination assumes same be first the can representatio from [20] sample sparse method test The the of (SRC) neurons. perspective classification visual the based from of natura represented properties severa that sparsely into the demonstrated image be been test can given has images the It classify where categories. to classification, fa- predefined is image found and goal is be basic example beneficial the can One definitely 19]. fields is [18, different vorable representation in sparse examples where many and years ogXu, Yong prerpeetto a trce uhatnini rece in attention much attracted has representation Sparse is origin, its of viewpoint the from representation, Sparse n ai Zhang, David and eirMme,IEEE, Member, Senior elw IEEE Fellow, ,visual n, i the aid ignal sin es ided isite tion ing, 0]. on or es ar nt m al e, is S n n n e e e 1 - l l JOURNAL 2 coefficients and training samples. The test sample will be of the way of exploiting the “atoms”: the holistic represen- classified as a member of the class, which leads to the tation based method and local representation based method minimum reconstruction residual. The literature [20] has also [21]. More specifically, holistic representation based methods demonstrated that the SRC method has great superiorities exploit training samples of all classes to represent the test sam- when addressing the image classification issue on corrupted ple, whereas local representation based methods only employ or disguised images. In such cases, each natural image can be training samples (or atoms) of each class or several classes sparsely represented and the sparse representation theory can to represent the test sample. Most of the sparse representation be utilized to fulfill the image classification task. methods are holistic representation based methods. A typical For , one important task is to extract key and representative local sparse representation methods is the components from a large number of clutter signals or groups of two-phase test sample sparse representation (TPTSR) method complex signals in coordination with different requirements. [9]. In consideration of different methodologies, the sparse Before the appearance of sparse representation, SST and representation method can be grouped into two aspects: pure Nyquist sampling law (NSL) were the traditional methods sparse representation and hybrid sparse representation, which for signal acquisition and the general procedures included improves the pre-existing sparse representation methods with sampling, coding compression, transmission, and decoding. the aid of other methods. The literature [22] suggests that Under the frameworks of SST and NSL, the greatest difficulty sparse representation algorithms roughly fall into three classes: of signal processing lies in efficient sampling from mass convex relaxation, greedy algorithms, and combinational meth- data with sufficient memory-saving. In such a case, sparse ods. In the literature [23, 24], from the perspective of sparse representation theory can simultaneously break the bottleneck problem modeling and problem solving, sparse decomposition of conventional sampling rules, i.e. SST and NSL, so that it has algorithms are generally divided into two sections: greedy a very wide application prospect. Sparse representation theory algorithms and convex relaxation algorithms. On the other proposes to integrate the processes of signal sampling and hand, if the viewpoint of optimization is taken into consid- coding compression. Especially, sparse representation theory eration, the problems of sparse representation can be divided employs a more efficient sampling rate to measure the original into four optimization problems: the smooth convex problem, sample by abandoning the pristine measurements of SST and nonsmooth nonconvex problem, smooth nonconvex problem, NSL, and then adopts an optimal reconstruction algorithm to and nonsmooth convex problem. Furthermore, Schmidt et al. reconstruct samples. In the context of , it [25] reviewed some optimization techniques for solving l1- is first assumed that all the signals are sparse or approximately norm regularization problems and roughly divided these ap- sparse enough [4, 6, 7]. Compared to the primary signal proaches into three optimization strategies: sub-gradient meth- space, the size of the set of possible signals can be largely ods, unconstrained approximation methods, and constrained decreased under the constraint of sparsity. Thus, massive optimization methods. The supplementary file attached with algorithms based on the sparse representation theory have the paper also offers more useful information to make fully been proposed to effectively tackle signal processing issues understandings of the ‘taxonomy’ of current sparse represen- such as signal reconstruction and recovery. To this end, the tation techniques in this paper. sparse representation technique can save a significant amount In this paper, the available sparse representation methods are of sampling time and sample storage space and it is favorable categorized into four groups, i.e. the greedy strategy approxi- and advantageous. mation, constrained optimization strategy, proximity algorithm based optimization strategy, and homotopy algorithm based sparse representation, with respect to the analytical solution A. Categorization of sparse representation techniques and optimization viewpoints. Sparse representation theory can be categorized from different (1) In the greedy strategy approximation for solving sparse viewpoints. Because different methods have their individual representation problem, the target task is mainly to solve motivations, ideas, and concerns, there are varieties of strate- the sparse representation method with l0-norm minimization. gies to separate the existing sparse representation methods Because of the fact that this problem is an NP-hard problem into different categories from the perspective of taxonomy. [26], the greedy strategy provides an approximate solution to For example, from the viewpoint of “atoms”, available sparse alleviate this difficulty. The greedy strategy searches for the representation methods can be categorized into two general best local optimal solution in each iteration with the goal of groups: naive sample based sparse representation and dictio- achieving the optimal holistic solution [27]. For the sparse nary learning based sparse representation. However, on the representation method, the greedy strategy approximation only basis of the availability of labels of “atoms”, sparse repre- chooses the most k appropriate samples, which are called k- sentation and learning methods can be coarsely divided into sparsity, to approximate the measurement vector. three groups: supervised learning, semi-supervised learning, (2) In the constrained optimization strategy, the core idea is and unsupervised learning methods. Because of the sparse con- to explore a suitable way to transform a non-differentiable op- straint, sparse representation methods can be divided into two timization problem into a differentiable optimization problem communities: structure constraint based sparse representation by replacing the l1-norm minimization term, which is convex and sparse constraint based sparse representation. Moreover, but nonsmooth, with a differentiable optimization term, which in the field of image classification, the representation based is convex and smooth. More specifically, the constrained op- classification methods consist of two main categories in terms timization strategy substitutes the l1-norm minimization term JOURNAL 3

Sparse Representation

Main Body Greedy Strategy Approximation Fundamentals and General (IV) Frameworks (II-III)

Constrained Optimization (V) Unsupervised Dictionary Learning Representative Algorithms (I 9-VII) Dictionary Learning Proximity Algorithm-based Optimization (VI) Supervised Dictionary Learning Extensive Applications Homotopy Algorithm-based (VIII) Algorithms (VII) Image Processing Super-resolution restoration Experimental Evaluation denoising Real-world Applications (IX)

Image classification and visual Conclusion (X) tracking

Fig. 1: The structure of this paper. The main body of this paper mainly consists of four parts: basic concepts and frameworks in Section II-III, representative algorithms in Section IV-VII and extensive applications in Section VIII, massive experimental evaluations in Section IX. Conclusion is summarized in Section X. with an equal constraint condition on the original unconstraint fields. Extensive state-of-art sparse representation methods problem. If the original unconstraint problem is reformulated are summarized and the ideas, algorithms, and wide applica- into a differentiable problem with constraint conditions, it will tions of sparse representation are comprehensively presented. become an uncomplicated problem in the consideration of the Specifically, there is concentration on introducing an up-to- fact that l1-norm minimization is global non-differentiable. date review of the existing literature and presenting some (3) Proximal algorithms can be treated as a powerful tool insights into the studies of the latest sparse representation for solving nonsmooth, constrained, large-scale, or distributed methods. Moreover, the existing sparse representation methods versions of the optimization problem [28]. In the proximity al- are divided into different categories. Subsequently, correspond- gorithm based optimization strategy for sparse representation, ing typical algorithms in different categories are presented the main task is to reformulate the original problem into the and their distinctness is explicitly shown. Finally, the wide specific model of the corresponding proximal operator such as applications of these sparse representation methods in different the soft thresholding operator, hard thresholding operator, and fields are introduced. resolvent operator, and then exploits the proximity algorithms The remainder of this paper is mainly composed of four to address the original sparse optimization problem. parts: basic concepts and frameworks are shown in Section II (4) The general framework of the homotopy algorithm is and Section III, representative algorithms are presented in Sec- to iteratively trace the final desired solution starting from the tion IV-VII and extensive applications are illustrated in Section initial point to the optimal point by successively adjusting the VIII, massive experimental evaluations are summarized in Sec- homotopy parameter [29]. In homotopy algorithm based sparse tion IX. More specifically, the fundamentals and preliminary representation, the homotopy algorithm is used to solve the l1- mathematic concepts are presented in Section II, and then the norm minimization problem with k-sparse property. general frameworks of the existing sparse representation with different norm regularizations are summarized in Section III. B. Motivation and objectives In Section IV, the greedy strategy approximation method is In this paper, a survey on sparse representation and overview presented for obtaining a sparse representation solution, and in available sparse representation algorithms from viewpoints of Section V, the constrained optimization strategy is introduced the mathematical and theoretical optimization is provided. This for solving the sparse representation issue. Furthermore, the paper is designed to provide foundations of the study on sparse proximity algorithm based optimization strategy and Homo- representation and aims to give a good start to newcomers in topy strategy for addressing the sparse representation problem computer vision and pattern recognition communities, who are are outlined in Section VI and Section VII, respectively. Sec- interested in sparse representation methodology and its related tion VIII presents extensive applications of sparse represen- JOURNAL 4

(a) (b) (c) (d)

Fig. 2: Geometric interpretations of different norms in 2-D space [7]. (a), (b), (c), (d) are the unit ball of the l0-norm, l1-norm, l2-norm, lp-norm (0

tation in widespread and prevalent fields including dictionary Suppose that v = [v1, v2, , vn] is an n dimensional learning methods and real-world applications. Finally, Section vector in Euclidean space, thus··· IX offers massive experimental evaluations and conclusions n are drawn and summarized in Section X. The structure of the v = ( v p)1/p (II.3) k kp | i| this paper has been summarized in Fig. 1. i=1 X is denoted as the p-norm or the l -norm (1 p ) of p ≤ ≤ ∞ II. FUNDAMENTALS AND PRELIMINARY CONCEPTS vector v. When p=1, it is called the l1-norm. It means the sum of A. Notations absolute values of the elements in vector v , and its geometric In this paper, vectors are denoted by lowercase letters with interpretation is shown in Fig. 2b, which is a square with a bold face, e.g. x. Matrices are denoted by uppercase letter, forty-five degree rotation. e.g. X and their elements are denoted with indexes such as When p=2, it is called the l2-norm or Euclidean norm. It is X . In this paper, all the data are only real-valued. defined as v = (v2 + v2 + + v2 )1/2, and its geometric i k k2 1 2 ··· n Suppose that the sample is from space Rd and thus all the interpretation in 2-D space is shown in Fig. 2c which is a samples are concatenated to form a , denoted as D circle. d n ∈ R × . If any sample can be approximately represented by In the literature, the sparsity of a vector v is always related a linear combination of dictionary D and the number of the to the so-called l0-norm, which means the number of the samples is larger than the dimension of samples in D, i.e. n> nonzero elements of vector v. Actually, the l0-norm is the d, dictionary D is referred to as an over-complete dictionary. limit as p 0 of the lp-norms [8] and the definition of the → A signal is said to be compressible if it is a sparse signal in the l0-norm is formulated as original or transformed domain when there is no information n p p or energy loss during the process of transformation. v 0 = lim v p = lim vi (II.4) k k p 0 k k p 0 | | → → i=1 “sparse” or “sparsity” of a vector means that some ele- X ments of the vector are zero. We use a linear combination of We can see that the notion of the l0-norm is very convenient N N N 1 a basis matrix A R × to represent a signal x R × , i.e. and intuitive for defining the sparse representation problem. ∈ N 1 ∈ x = As where s R × is the column vector of weighting The property of the l -norm can also be presented from the ∈ 0 coefficients. If only k (k N) elements of s are nonzero perspective of geometric interpretation in 2-D space, which is ≪ and the rest elements in s are zero, we call the signal x is shown in Fig. 2a, and it is a crisscross. k-sparse. Furthermore, the geometric meaning of the lp-norm (00) on the parameter vector x, and then the set of real n dimensions, is defined as following function is obtained: n x y xT y x y x y x y (II.1) , = = 1 1 + 2 2 + + n n f(x)= x p = ( x p) (II.5) h i ··· k kp | i| m n i=1 The standard inner product of two matrixes, X R × and X m n ∈ Y R × from the set of real m n matrixes, is denoted The relationships between different norms are summarized as∈ the following equation × in Fig. 3. From the illustration in Fig. 3, the conclusions are as follows. The l0-norm function is a nonconvex, nonsmooth, m n discontinuity, global nondifferentiable function. The l -norm X, Y = tr(XT Y )= X Y (II.2) p h i ij ij (0

y y y

y= Ax y= Ax y= Ax

x x ||x || 0 x ||x || 1

2 ||x || 2

(a) (b) (c) Fig. 4: The geometry of the solutions of different norm regularization in 2-D space [7]. (a), (b) and (c) are the geometry of the solutions of the l0-norm, l1-norm, l2-norm minimization, respectively.

condition of sparsity. However, it has been demonstrated that L norm 0  the representation solution of the l2-norm minimization is not L norm 1/10 strictly sparse enough but “limitedly-sparse”, which means it L norm 1/3 f()| x= x | p possesses the capability of discriminability [30]. i i L norm m n 1/2 The Frobenius norm, L1-norm of matrix X R × , and L norm 2/3 ∈ l2-norm or spectral norm are respectively defined as L norm 1 n m m L norm  2 X = ( X2 )1/2, X = max x , k kF j,i k kL1 j=1,...,n | ij | i=1 j=1 i=1 X X X T 1/2 X 2 = δmax(X) = (λmax(X X)) k k (II.6)

where δ is the singular value operator and the l2-norm of X is its maximum singular value [31]. The l2,1-norm or R1-norm is defined on matrix term, that  is xi n m X = ( X2 )1/2 (II.7) k k2,1 j,i i=1 j=1 Fig. 3: Geometric interpretations of different norms in 1-D X X space [7]. As shown above, a norm can be viewed as a measure of the length of a vector v. The distance between two vectors x and y, or matrices X and Y , can be measured by the length of their differences, i.e. In order to more specifically elucidate the meaning and dist(x, y)= x y 2, dist(X, Y )= X Y (II.8) solutions of different norm minimizations, the geometry in k − k2 k − kF 2-D space is used to explicitly illustrate the solutions of the which are denoted as the distance between x and y in the l -norm minimization in Fig. 4a, l -norm minimization in Fig. 0 1 context of the l2-norm and the distance between X and Y in 4b, and l -norm minimization in Fig. 4c. Let S = x∗ : 2 { the context of the Frobenius norm, respectively. Ax = y denote the line in 2-D space and a hyperplane Rm n } Assume that X × and the rank of X, i.e. rank(X)= will be formulated in higher dimensions. All possible solution r. The SVD of X∈is computed as x∗ must lie on the line of S. In order to visualize how T to obtain the solution of different norm-based minimization X = UΛV (II.9) problems, we take the l1-norm minimization problem as an m r T n r where U R × with U U = I and V R × with example to explicitly interpret. Suppose that we inflate the l1- V T V = I∈. The columns of U and V are called∈ left and ball from an original status until it hits the hyperplane S at right singular vectors of X, respectively. Additionally, Λ is some point. Thus, the solution of the l1-norm minimization a diagonal matrix and its elements are composed of the problem is the aforementioned touched point. If the sparse singular values of X, i.e. Λ = diag(λ1, λ2, , λr) with solution of the linear system is localized on the coordinate ··· λ1 λ2 λr > 0. Furthermore, the singular value axis, it will be sparse enough. From the perspective of Fig. decomposition≥ ≥ ··· can ≥ be rewritten as 4, it can be seen that the solutions of both the l -norm 0 r and l -norm minimization are sparse, whereas for the l - 1 2 X = λ u v (II.10) norm minimization, it is very difficult to rigidly satisfy the i i i i=1 X JOURNAL 6

where λi, ui and vi are the i-th singular value, the i-th column be zero or very close to zero and few of the entries in the of U, and the i-th column of V , respectively [31]. representation solution are differentially large. The sparsest representation solution can be acquired by III. SPARSE REPRESENTATION PROBLEM WITH DIFFERENT solving the linear representation system III.2 with the l0- NORM REGULARIZATIONS norm minimization constraint [52]. Thus problem III.2 can be converted to the following optimization problem: In this section, sparse representation is summarized and grouped into different categories in terms of the norm reg- αˆ = arg min α 0 s.t. y = Xα (III.3) ularizations used. The general framework of sparse represen- k k where refers to the number of nonzero elements in the tation is to exploit the linear combination of some samples k·k0 or “atoms” to represent the probe sample, to calculate the vector and is also viewed as the measure of sparsity. Moreover, representation solution, i.e. the representation coefficients of if just k (k

2 y = x1α1 + x2α2 + + xnαn (III.1) αˆ = L(α, λ) = arg min y Xα + λ α (III.8) ··· k − k2 k k0 where α (i=1,2, ,n) is the coefficient of x and Eq. III.1 where λ refers to the Lagrange multiplier associated with i ··· i can be rewritten into the following equation for convenient α 0. description: k k y = Xα (III.2) B. Sparse representation with l1-norm minimization T where matrix X=[x1, x2, , xn] and α=[α1, α2, , αn] . The l1-norm originates from the problem [41, 42] and However, problem III.2··· is an underdetermined linear··· system it has been extensively used to address issues in machine learn- of equations and the main problem is how to solve it. From ing, pattern recognition, and statistics [53–55]. Although the the viewpoint of linear algebra, if there is not any prior sparse representation method with l0-norm minimization can knowledge or any constraint imposed on the representation obtain the fundamental sparse solution of α over the matrix X, solution α, problem III.2 is an ill-posed problem and will the problem is still a non-deterministic polynomial-time hard never have a unique solution. That is, it is impossible to (NP-hard) problem and the solution is difficult to approximate utilize equation III.2 to uniquely represent the probe sample y [26]. Recent literature [20, 56–58] has demonstrated that when using the measurement matrix X. To alleviate this difficulty, the representation solution obtained by using the l1-norm it is feasible to impose an appropriate regularizer constraint or minimization constraint is also content with the condition of regularizer function on representation solution α. The sparse sparsity and the solution using l1-norm minimization with representation method demands that the obtained represen- sufficient sparsity can be equivalent to the solution obtained tation solution should be sparse. Hereafter, the meaning of by l0-norm minimization with full probability. Moreover, the ‘sparse’ or ‘sparsity’ refers to the condition that when the l1-norm optimization problem has an analytical solution and linear combination of measurement matrix is exploited to can be solved in polynomial time. Thus, extensive sparse represent the probe sample, many of the coefficients should representation methods with the l1-norm minimization have JOURNAL 7

been proposed to enrich the sparse representation theory. On the other hand, the l2,1-norm is also called the rotation The applications of sparse representation with the l1-norm invariant l1-norm, which is proposed to overcome the difficulty minimization are extraordinarily and remarkably widespread. of robustness to outliers [62]. The objective function of the Correspondingly, the main popular structures of sparse rep- sparse representation problem with the l2,1-norm minimization resentation with the l1-norm minimization , similar to sparse is to solve the following problem: representation with l0-norm minimization, are generally used arg min Y XA 2,1 + µ A 2,1 (III.17) to solve the following problems: A k − k k k where Y = [y , y , , y ] refers to the matrix composed 1 2 ··· N αˆ = arg min α 1 s.t. y = Xα (III.9) of samples, A = [a1, a2, , aN ] is the corresponding α k k coefficient matrix of X, and···µ is a small positive constant. Sparse representation with the l2,1-norm minimization can 2 be implemented by exploiting the proposed algorithms in αˆ = arg min α 1 s.t. y Xα 2 ε (III.10) α k k k − k ≤ literature [45–47]. or IV. GREEDY STRATEGY APPROXIMATION 2 αˆ = arg min y Xα s.t. α 1 τ (III.11) α k − k2 k k ≤ Greedy algorithms date back to the 1950s. The core idea of the greedy strategy [7, 23] is to determine the position based on the relationship between the atom and probe sample, and 1 2 αˆ = L(α, λ) = arg min y Xα + λ α 1 (III.12) then to use the least square to evaluate the amplitude value. α 2k − k2 k k Greedy algorithms can obtain the local optimized solution in where λ and τ are both small positive constants. each step in order to address the problem. However, the greedy algorithm can always produce the global optimal solution or

C. Sparse representation with lp-norm (0

So y = R0, dl0 dl0 + R1 where R0, dl0 dl0 represents the 2 2 h i h i αˆ = arg min α s.t. y Xα ε (III.15) orthogonal projection of y onto dl0 , and R1 is the representa- α k k2 k − k2 ≤ tion residual by using dl0 to represent y. Considering the fact or that dl0 is orthogonal to R1, Eq. IV.2 can be rewritten as αˆ = L(α, λ) = arg min y Xα 2 + λ α 2 (III.16) y 2 = y, d 2 + R 2 (IV.3) α k − k2 k k2 k k |h l0 i| k 1k JOURNAL 8

To obtain the minimum representation residual, the MP al- C. Series of matching pursuit algorithms gorithm iteratively figures out the best matching atom from the It is an excellent choice to employ the greedy strategy to over-completed dictionary, and then utilizes the representation approximate the solution of sparse representation with the residual as the next approximation target until the termination l0-norm minimization. These algorithms are typical greedy condition of iteration is satisfied. For the t-th iteration, the best iterative algorithms. The earliest algorithms were the matching matching atom is d and the approximation result is found lt pursuit (MP) and orthogonal matching pursuit (OMP). The from the following equation: basic idea of the MP algorithm is to select the best matching atom from the overcomplete dictionary to construct sparse Rt = Rt, dlt dlt + Rt+1 (IV.4) h i approximation during each iteration, to compute the signal where the dlt satisfies the equation: representation residual, and then to choose the best matching atom till the stopping criterion of iteration is satisfied. Many R , d = sup R , d (IV.5) |h t lt i| |h t ii| more greedy algorithms based on the MP and OMP algorithm

Clearly, dlt is orthogonal to Rk+1, and then such as the efficient orthogonal matching pursuit algorithm [65] subsequently have been proposed to improve the pursuit R 2 = R , d 2 + R 2 (IV.6) k kk |h t lt i| k t+1k algorithm. Needell et al. proposed an regularized version of 2 For the n-th iteration, the representation residual Rn orthogonal matching pursuit (ROMP) algorithm [37], which τ where τ is a very small constant and the probek samplek ≤y recovered all k sparse signals based on the Restricted Isometry can be formulated as: Property of random frequency measurements, and then pro- n 1 posed another variant of OMP algorithm called compressive − y = R , d d + R (IV.7) sampling matching pursuit (CoSaMP) algorithm [66], which h j lj i lj n j=1 incorporated several existing ideas such as restricted isometry X property (RIP) and pruning technique into a greedy iterative If the representation residual is small enough, the probe structure of OMP. Some other algorithms also had an impres- sample y can approximately satisfy the following equation: n 1 sive influence on future research on CS. For example, Donoho y − R , d d where n N. Thus, the probe j=1 j lj lj et al. proposed an extension of OMP, called stage-wise orthog- sample≈ can beh representedi by a small≪ number of elements from P onal matching pursuit (StOMP) algorithm [67], which depicted a large dictionary. In the context of the specific representation an iterative algorithm with three main steps, i.e. threholding, error, the termination condition of sparse representation is that selecting and projecting. Dai and Milenkovic proposed a the representation residual is smaller than the presupposed new method for sparse signal reconstruction named subspace value. More detailed analysis on matching pursuit algorithms pursuit (SP) algorithm [68], which sampled signals satisfying can be found in the literature [63]. the constraints of the RIP with a constant parameter. Do et al. presented a sparsity adaptive matching pursuit (SAMP) B. Orthogonal matching pursuit algorithm algorithm [69], which borrowed the idea of the EM algorithm The orthogonal matching pursuit (OMP) algorithm [36, 64] to alternatively estimate the sparsity and support set. Jost et al. is an improvement of the MP algorithm. The OMP employs proposed a tree-based matching pursuit (TMP) algorithm [70], the process of orthogonalization to guarantee the orthogonal which constructed a tree structure and employed a structuring direction of projection in each iteration. It has been verified strategy to cluster similar signal atoms from a highly redundant that the OMP algorithm can be converged in limited iterations dictionary as a new dictionary. Subsequently, La and Do pro- [36]. The main steps of OMP algorithm have been summarized posed a new tree-based orthogonal matching pursuit (TBOMP) in Algorithm 1. algorithm [71], which treated the sparse tree representation as an additional prior knowledge for linear inverse systems Algorithm 1. Orthogonal matching pursuit algorithm by using a small number of samples. Recently, Karahanoglu Task: Approximate the constraint problem: and Erdogan conceived a forward-backward pursuit (FBP) αˆ = argminα α 0 s.t. y = Xα k k method [72] with two greedy stages, in which the forward Input: Probe sample y, measurement matrix X, sparse coefficients vector stage enlarged the support estimation and the backward stage α removed some unsatisfied atoms. More detailed treatments of Initialization: t = 1, r0 = y, α = 0, D0 = φ, index set Λ0 = φ where φ denotes empty set, τ is a small constant. the greedy pursuit for sparse representation can be found in While rt > τ do the literature [23]. Stepk 1:k Find the best matching sample, i.e. the biggest inner product between rt 1 and xj (j Λt 1) by exploiting − 6∈ − λt = arg maxj Λt−1 rt 1, xj . V. CONSTRAINED OPTIMIZATION STRATEGY 6∈ |h − i| Step 2: Update the index set Λt =Λt 1 λt and reconstruct data set − S Dt =[Dt 1, xλt ]. Constrained optimization strategy is always utilized to obtain Step 3: Compute the− sparse coefficient by using the least square algorithm the solution of sparse representation with the l1-norm reg- 2 α˜ = argmin y Dtα˜ . k − k2 ularization. The methods that address the non-differentiable Step 4: Update the representation residual using rt = y Dtα˜. Step 5: t = t + 1. − unconstrained problem will be presented by reformulating it End as a smooth differentiable constrained optimization problem. Output: D, α These methods exploit the constrained optimization method with efficient convergence to obtain the sparse solution. What JOURNAL 9 is more, the constrained optimization strategy emphasizes Problem V.6 can be addressed with the close-form solution the equivalent transformation of α 1 in problem III.12 and (gt)T (gt) k k σt = (V.8) employs the new reformulated constrained problem to obtain (gt)T A(gt) a sparse representation solution. Some typical methods that employ the constrained optimization strategy to solve the Furthermore, the basic GPSR algorithm employs the back- original unconstrained non-smooth problem are introduced in tracking linear search method [31] to ensure that the step size this section. of gradient descent, in each iteration, is a more proper value. The stop condition of the backtracking linear search should satisfy A. Gradient Projection Sparse Reconstruction zt t zt zt zt T The core idea of the gradient projection sparse representation G(( σ G( ))+) > G( ) β G( ) − ∇ t t −t ∇ t (V.9) method is to find the sparse representation solution along (z (z σ G(z ))+) − − ∇ with the gradient descent direction. The first key procedure of where β is a small constant. The main steps of GPSR are gradient projection sparse reconstruction (GPSR) [73] provides summarized in Algorithm 2. For more detailed information, a constrained formulation where each value of α can be split one can refer to the literature [73]. into its positive and negative parts. Vectors α+ and α are − introduced to denote the positive and negative coefficients of Algorithm 2. Gradient Projection Sparse Reconstruction (GPSR) α, respectively. The sparse representation solution α can be Task: To address the unconstraint problem: 1 2 αˆ = argminα y Xα + λ α 1 formulated as: 2 k − k2 k k Input: Probe sample y, the measurement matrix X, small constant λ Initialization: t = 0, β (0, 0.5), γ (0, 1), given α so that z = α = α+ α , α+ 0, α 0 (V.1) ∈ ∈ − − ≥ − ≥ [α+, α ]. While not− converged do where the operator ( )+ denotes the positive-part operator, t t t Step 1: Compute σ exploiting Eq. V.8 and σ mid(σ , σ , σmax), · 1T ← min which is defined as (x)+=max 0, x . Thus, α 1 = d α+ + where mid( , , ) denotes the middle value of the three parameters. 1T 1 { T } k k Step 2: While Eq.· V.9· · not satisfied d α , where d = [1, 1, , 1] is a d–dimensional vector − ··· do σt γσt end t+1 ←t t t d Step 3: z = (z σ G(z ))+ and t = t + 1. with d ones. Accordingly, problem III.12 can be reformulated End − ∇ as a constrained quadratic| {z problem:} Output: zt+1, α

1 2 arg min L(α) = arg min y X[α+ α ] 2+ 2k − − − k (V.2) 1T 1T λ( d α+ + d α ) s.t. α+ 0, α 0 B. Interior-point method based sparse representation strategy − ≥ − ≥ or The Interior-point method [31] is not an iterative algorithm but a smooth mathematic model and it always incorporates 1 2 arg min L(α) = arg min y [X+,X ][α+ α ] 2 the Newton method to efficiently solve unconstrained smooth 2k − − − − k 1T 1T problems of modest size [28]. When the Newton method is +λ( d α+ + d α ) s.t. α+ 0, α 0 − ≥ − ≥(V.3) used to address the optimization issue, a complex Newton equation should be solved iteratively which is very time- Furthermore, problem V.3 can be rewritten as: consuming. A method named the truncated Newton method 1 can effectively and efficiently obtain the solution of the arg min G(z)= cT z + zT Az s.t. z 0 (V.4) 2 ≥ Newton equation. A prominent algorithm called the truncated T T Newton based interior-point method (TNIPM) exists, which where z = [α+; α ], c = λ12d + [ X y; X y], − T − T can be utilized to solve the large-scale l1-regularized least 1 T X X X X 2d = [1, , 1] , A = T − T . squares (i.e. l l ) problem [74]. ··· X X X X 1 s 2d  −  The original problem of l1 ls is to solve problem III.12 and The GPSR algorithm employs the gradient descent and the core procedures of l1 ls are shown below: standard| line-search{z } method [31] to address problem V.4. The (1) Transform the original unconstrained non-smooth problem value of z can be iteratively obtained by utilizing to a constrained smooth optimization problem. arg min zt+1 = zt σ G(zt) (V.5) (2) Apply the interior-point method to reformulate the con- − ∇ strained smooth optimization problem as a new unconstrained where the gradient of G(zt) = c + Azt and σ is the step smooth optimization problem. ∇ size of the iteration. For step size σ, GPSR updates the step (3) Employ the truncated Newton method to solve this uncon- size by using strained smooth problem. The main idea of the l l will be briefly described. σt = arg min G(zt σgt) (V.6) 1 s σ − For simplicity of presentation, the following one-dimensional where the function gt is pre-defined as problem is used as an example. α = arg min σ (V.10) | | σ α σ ( G(zt)) , if zt > 0 or ( G(zt)) < 0 − ≤ ≤ gt = i i i (V.7) i 0∇, otherwise. ∇ where σ is a proper positive constant.  JOURNAL 10

Thus, problem III.12 can be rewritten as c) The duality gap is constructed, which is the gap between

1 2 the primary problem and the dual problem: αˆ = arg min 2 y Xα 2 + λ α 1 1 k − k2 k Nk = arg min 2 y Xα 2 + λ i=1 min σi αi σi σi g = y Xα + λ α 1 F (u) (V.19) 1 k − k2 − ≤ ≤N k − k k k − = arg min y Xα 2 + λ min σi αi σi σi 2 − ≤ ≤ i=1 k − k1 P 2 N Third, the method of backtracking linear search is used to = arg min σi αi σi 2 y Xα 2 + λ i=1 σi − ≤ ≤ k − k P (V.11) determine an optimal step size of the Newton linear search. Thus problem III.12 is also equivalent to solveP the following The stopping condition of the backtracking linear search is problem: G(α+ηt α, σ+ηt σ) > G(α, σ)+ρηt G(α, σ)[ α, σ] △ △ ∇ △ △ N (V.20) 1 2 t αˆ = arg min y Xα 2+λ σi s.t. σi αi σi where ρ (0, 0.5) and η (0, 1) is the step size of the α,σ RN 2k − k − ≤ ≤ ∈ ∈ ∈ i=1 Newton linear search. X (V.12) Finally, the termination condition of the Newton linear search or is set to N (V.21) 1 2 ζ = min 0.1,βg/ h 2 αˆ = arg min y Xα 2 + λ σi { k k } α,σ RN 2k − k (V.13) ∈ i=1 where the function α σ , is a small constant, X h = G( , ) β s.t. σ + α 0, σ α 0 and g is the duality gap.∇ The main steps of algorithm l l i i ≥ i − i ≥ 1 s are summarized in Algorithm 3. For further description and The interior-point strategy can be used to transform problem analyses, please refer to the literature [74]. V.13 into an unconstrained smooth problem

v N Algorithm 3. Truncated Newton based interior-point method (TNIPM) for 2 l ls αˆ = arg min G(α, σ)= y Xα 2+λv σi B(α, σ) 1 α,σ RN 2k − k − Task: To address the unconstraint problem: ∈ i=1 1 2 X αˆ = argminα y Xα + λ α 1 (V.14) 2 k − k2 k k N N where B(α, σ)= i=1 log(σi + αi)+ i=1 log(σi αi) is Input: Probe sample y, the measurement matrix X, small constant λ − Initialization: t = 1, v = 1 , ρ (0, 0.5), σ = 1 a barrier function, which forces the algorithm to be performed λ ∈ N within the feasibleP region in the contextP of unconstrained Step 1: Employ preconditioned conjugate gradient algorithm to obtain the approximation of H in Eq. V.15, and then obtain the descent direction condition. of linear search [ αt, σt]. Subsequently, utilizes the truncated Newton method Step 2: Exploit△ the algorithm△ of backtracking linear search to find the l1 ls t to solve problem V.14. The main procedures of addressing optimal step size of Newton linear search η , which satisfies the Eq. V.20. Step 3: Update the iteration point utilizing (αt+1, σt+1) = (αt, σt)+ problem V.14 are presented as follows: ( αt + σt). First, the Newton system is constructed △Step 4:△ Construct feasible point using eq. V.18 and duality gap in Eq. V.19, and compute the termination tolerance ζ in Eq. V.21. α Step 5: If the condition g/F (u) > ζ is satisfied, stop; Otherwise, return H △ = G(α, σ) R2N (V.15) to step 1, update v in Eq. V.14 and t = t + 1. σ −∇ ∈  △  Output: α 2 2N 2N where H = G(α, σ) R × is the Hessian ma- trix, which is−∇ computed using∈ the preconditioned conjugate The truncated Newton based interior-point method (TNIPM) gradient algorithm, and then the direction of linear search [75] is a very effective method to solve the l1-norm regu- [ α, σ] is obtained. larization problems. Koh et al. [76] also utilized the TNIPM Second,△ △ the Lagrange dual of problem III.12 is used to to solve large scale logistic regression problems, which em- construct the dual feasible point and duality gap: ployed a preconditioned conjugate gradient method to compute a) The Lagrangian function and Lagrange dual of problem the search step size with warm-start techniques. Mehrotra III.12 are constructed. The Lagrangian function is reformu- proposed to exploit the interior-point method to address the lated as primal-dual problem [77] and introduced the second-order derivation of Taylor polynomial to approximate a primal-dual T L(α, z, u)= z z + λ α 1 + u(Xα y z) (V.16) trajectory. More analyses of interior-point method for sparse k k − − representation can be found in the literature [78]. where its corresponding Lagrange dual function is 1 αˆ = argmax F (u)= uT u uT y s.t. C. Alternating direction method (ADM) based sparse repre- −4 − (V.17) T sentation strategy (X u)i λi (i =1, 2, ,N) | |≤ ··· This section shows how the ADM [43] is used to solve primal b) A dual feasible point is constructed and dual problems in III.12. First, an auxiliary variable is T introduced to convert problem in III.12 into a constrained u =2s(y Xα), s = min λ/ 2yi 2(X Xα)i i (V.18) − { | − |}∀ problem with the form of problem V.22. Subsequently, the where u is a dual feasible point and s is the step size of the alternative direction method is used to efficiently address the linear search. sub-problems of problem V.22. By introducing the auxiliary JOURNAL 11 term s Rd, problem III.12 is equivalent to a constrained where soft(σ, η)= sign(σ)max σ η, 0 . problem∈ Finally, the Lagrange multiplier vector{| |−λ is} updated by using 1 Eq. V.24(c). arg min s 2 + α 1 s.t. s = y Xα (V.22) The algorithm presented above utilizes the second order α,s 2τ k k k k − Taylor expansion to approximately solve the sub-problem V.27 The optimization problem of the augmented Lagrangian func- and thus the algorithm is denoted as an inexact ADM or tion of problem V.22 is considered approximate ADM. The main procedures of the inexact ADM 1 T based sparse representation method are summarized in Algo- arg min L(α, s, λ)= s 2 + α 1 λ (s+ α,s,λ 2τ k k k k − (V.23) rithm 4. More specifically, the inexact ADM described above µ 2 is to reformulate the unconstrained problem as a constrained Xα y)+ s + Xα y 2 − 2 k − k problem, and then utilizes the alternative strategy to effectively where λ Rd is a Lagrange multiplier vector and µ is a address the corresponding sub-optimization problem. More- ∈ penalty parameter. The general framework of ADM is used to over, ADM can also efficiently solve the dual problems of solve problem V.23 as follows: the primal problems III.9-III.12. For more information, please refer to the literature [43, 79]. st+1 = arg min L(s, αt, λt) (a) αt+1 = arg min L(st+1, α, λt) (b) (V.24) Algorithm 4. Alternating direction method (ADM) based sparse represen-  tation strategy λt+1 = λt µ(st+1 + Xαt+1 y) (c)  − − Task: To address the unconstraint problem: αˆ = argmin 1 y Xα 2 + τ α First, the first optimization problem V.24(a) is considered α 2 2 1  k − k k k Input: Probe sample y, the measurement matrix X, small constant λ t t 1 t t T t 0 0 0 arg min L(s, α , λ )= s 2 + α 1 (λ ) (s + Xα Initialization: t = 0, s = 0, α = 0, λ = 0, τ = 1.01, µ is a small 2τ k k k k − constant. µ y)+ s + Xαt y 2 Step 1: Construct the constraint optimization problem of problem III.12 by − 2 k − k2 introducing the auxiliary parameter and its augmented Lagrangian function, i.e. problem (V.22) and (V.23). 1 t T µ t 2 = s 2 (λ ) s + s + Xα y 2+ While not converged do 2τ k k − 2 k − k Step 2: Update the value of the st+1 by using Eq. (V.25). t t T t α 1 (λ ) (Xα y) Step 2: Update the value of the αt+1 by using Eq. (V.29). k k − − (V.25) Step 3: Update the value of the λt+1 by using Eq. (V.24(c)). Step 4: µt+1 = τµt and t = t + 1. Then, it is known that the solution of problem V.25 with End While respect to s is given by Output: αt+1 τ st+1 = (λt µ(y Xαt)) (V.26) 1+ µτ − − Second, the optimization problem V.24(b) is considered VI. PROXIMITY ALGORITHM BASED OPTIMIZATION STRATEGY t+1 t 1 t+1 T t+1 arg min L(s , α, λ )= s 2 + α 1 (λ) (s In this section, the methods that exploit the proximity algo- 2τ k k k k − µ rithm to solve constrained convex optimization problems are + Xα y)+ st+1 + Xα y 2 − 2 k − k2 discussed. The core idea of the proximity algorithm is to utilize which is equivalent to the proximal operator to iteratively solve the sub-problem, which is much more computationally efficient than the original µ problem. The proximity algorithm is frequently employed to arg min α (λt)T (st+1 + Xα y)+ st+1+ {k k1 − − 2 k solve nonsmooth, constrained convex optimization problems 2 Xα y 2 [28]. Furthermore, the general problem of sparse represen- µ − k } tation with l -norm regularization is a nonsmooth convex = α + st+1 + Xα y λt/µ 2 1 k k1 2 k − − k2 optimization problem, which can be effectively addressed by = α 1 + f(α) using the proximal algorithm. k k (V.27) Suppose a simple constrained optimization problem is µ t+1 t 2 where f(α)= 2 s +Xα y λ /µ 2. If the second order min h(x) x χ (VI.1) Taylor expansionk is used to− approximate− k f(α), the problem { | ∈ } where χ Rn. The general framework of addressing the con- V.27 can be approximately reformulated as strained convex⊂ optimization problem VI.1 using the proximal t T T t+1 t t arg min α 1 + (α α ) X (s + Xα y λ /µ) algorithm can be reformulated as {k k − − − 1 t 2 t τ t 2 + α α 2 x˜ = arg min h(x)+ x x x χ (VI.2) 2τ k − k } { 2 k − k | ∈ } (V.28) where τ and xt are given. For definiteness and without loss where τ is a proximal parameter. The solution of problem V.28 of generality, it is assumed that there is the following linear can be obtained by the soft thresholding operator constrained convex optimization problem t+1 t T t+1 t t τ α = soft α τX (s +Xα y λ /µ), (V.29) arg min F (x)+ G(x) x χ (VI.3) { − − − µ} { | ∈ } JOURNAL 12

The solution of problem VI.3 obtained by employing the First, Taylor expansion is used to approximate f(α) = 1 2 t proximity algorithm is: 2 Xα y 2 at a point of α . The second order Taylor expansionk − isk 1 xt+1 x xt x xt x xt 2 = arg min F ( )+ G( ), + 1 { h∇ − i 2τ k − k } f(α)= f(αt) + (α αt)T f(αt)+ (α αt)T 1 t 2 − ∇ 2 − (VI.9) = arg min F (x)+ x θ t t { 2τ k − k } Hf (α )(α α )+ (VI.4) − ··· t t where Hf (α ) is the Hessian matrix of f(α) at α . For the where θ = xt τ G(xt). More specifically, for the sparse T T − ∇ function f(α), f(α)= X (Xα y) and Hf (α)= X X representation problem with l1-norm regularization, the main can be obtained.∇ − problem can be reformulated as:

min P (α)= λ α 1 Aα = y 1 t 2 t T T t { k k | } (VI.5) f(α)= Xα y 2 + (α α ) X (Xα y)+ or min P (α)= λ α + Aα y 2 α Rn 2k − k − − { k k1 k − k2 | ∈ } 1 (α αt)T XT X(α αt) which are considered as the constrained sparse representation 2 − − of problem III.12. (VI.10)

If the Hessian matrix Hf (α) is replaced or approximated in 1 A. Soft thresholding or shrinkage operator the third term above by using a scalar τ I, and then First, a simple form of problem III.12 is introduced, which 1 f(α) Xαt y 2 + (α αt)T XT (Xαt y) has a closed-form solution, and it is formulated as: ≈ 2k − k2 − − 1 t T t t 1 2 + (α α ) (α α )= Qt(α, α ) α∗ = min h(α)=λ α 1 + α s 2τ − − α k k 2k − k (VI.11) N N 1 (VI.6) = λ α + (α s )2 Thus problem VI.8 using the proximal algorithm can be | j | 2 j − j j=1 j=1 successively addressed by X X t+1 t where α∗ is the optimal solution of problem VI.6, and then α = arg min Qt(α, α )+ λ α 1 (VI.12) there are the following conclusions: k k (1) if α > 0, then h(α)= λα+ 1 α s 2 and its derivative Problem VI.12 is reformulated to a simple form of problem j 2 k − k is h′(α )= λ + α∗ s . VI.6 by j j − j Let h′(αj )=0 αj∗ = sj λ, where it indicates sj > λ; 1 ⇒ − 1 t t 2 t T T t (2) if α < 0, then h(α)= λα+ α s 2 and its derivative Qt(α, α )= Xα y 2+(α α ) X (Xα y)+ j − 2 k − k 2k − k − − is h′(αj )= λ + α∗ sj . 1 − j − α αt 2 Let h′(α )=0 α∗ = s + λ, where it indicates s < λ; 2 j ⇒ j j j − 2τ k − k (3) if λ sj λ, and then α∗ =0. 1 1 − ≤ ≤ j = Xαt y 2 + α αt + τXT (Xαt y) 2 So the solution of problem VI.6 is summarized as 2k − k2 2τ k − − k2 τ XT (Xαt y) 2 − 2 k − k2 sj λ, if sj > λ − 1 t T t 2 t α∗ = s + λ, if s < λ (VI.7) = α (α τX (Xα y)) 2 + B(α ) j  j j − 2τ k − − − k  0, otherwise (VI.13)

The equivalent expression of the solution is α∗ = t 1 t 2 τ T t 2  where the term B(α )= 2 Xα y 2 2 X (Xα y) shrink(s, λ), where the j-th component of shrink(s, λ) is in problem VI.12 is a constantk with− respectk − k to variable −α, andk shrink(s, λ)j = sign(sj)max sj λ, 0 . The operator it can be omitted. As a result, problem VI.12 is equivalent to {| | − } shrink( ) can be regarded as a proximal operator. the following problem: • 1 αt+1 = arg min α θ(αt) 2 + λ α (VI.14) B. Iterative shrinkage thresholding algorithm (ISTA) 2τ k − k2 k k1 The objective function of ISTA [80] has the form of where θ(αt)= αt τXT (Xαt y). The solution of− the simple problem− VI.6 is applied to 1 solve problem VI.14 where the parameter t is replaced by 2 t arg min F (α)= Xα y 2 + λ α 1 = f(α)+ λg(α) the equation θ(α ), and the solution of problem VI.14 is 2k − k k k (VI.8) αt+1 = shrink(θ(αt), λτ). Thus, the solution of ISTA is and is usually difficult to solve. Problem VI.8 can be con- reached. The techniques used here are called linearization or verted to the form of an easy problem VI.6 and the explicit preconditioning and more detailed information can be found procedures are presented as follows. in the literature [80, 81]. JOURNAL 13

C. Fast Iterative shrinkage thresholding algorithm (FISTA) more reliable approximation of Hf (α) in problem VI.9 using The fast iterative shrinkage thresholding algorithm (FISTA) is the Barzilai-Borwein (BB) spectral method [85]. The worm- an improvement of ISTA. FISTA [82] not only preserves the starting technique and BB spectral approach are introduced as efficiency of the original ISTA but also promotes the effective- follows. ness of ISTA so that FISTA can obtain global convergence. (1) Utilizing the worm-starting technique to optimize λ Considering that the Hessian matrix Hf (α) is approximated The values of λ in the sparse representation methods dis- 1 cussed above are always set to be a specific small constant. by using a scalar τ I for ISTA in Eq. VI.9, FISTA utilizes the minimum Lipschitz constant of the gradient f(α) to approx- However, Hale et al. [86] concluded that the technique that ∇ T imate the Hessian matrix of f(α), i.e. L(f)=2λmax(X X). exploits a decreasing value of λ from a warm-starting point Thus, the problem VI.8 can be converted to the problem below: can more efficiently solve the sub-problem VI.14 than ISTA 1 that is a fixed point iteration scheme. SpaRSA uses an adaptive f(α) Xαt y 2 + (α αt)T XT (Xαt y) continuation technique to update the value of λ so that it can ≈ 2k − k2 − − lead to the fastest convergence. The procedure regenerates the L t T t t + (α α ) (α α )= Pt(α, α ) value of λ using 2 − − (VI.15) λ = max γ XT y , λ (VI.18) where the solution can be reformulated as { k k∞ }

t+1 L t 2 where γ is a small constant. α = arg min α θ(α ) 2 + λ α 1 (VI.16) 2 k − k k k (2) Utilizing the BB spectral method to approximate Hf (α) t t 1 T t 1 where θ(α )= α L X (Xα y). ISTA employs τ I to approximate the matrix Hf (α), which Moreover, to accelerate− the convergence− of the algorithm, is the Hessian matrix of f(α) in problem VI.9 and FISTA FISTA also improves the sequence of iteration points, instead exploits the Lipschitz constant of f(α) to replace Hf (α). of employing the previous point it utilizes a specific linear However, SpaRSA utilizes the BB∇ spectral method to choose t t 1 combinations of the previous two points α , α − , i.e. the value of τ to mimic the Hessian matrix. The value of τ is { } t required to satisfy the condition: t t µ 1 t t 1 α = α + − (α α − ) (VI.17) µt+1 − 1 (αt+1 αt) f(αt+1) f(αt) (VI.19) where µt is a positive sequence, which satisfies µt (t+1)/2, τ t+1 − ≈ ∇ − ∇ ≥ and the main steps of FISTA are summarized in Algorithm 5. which satisfies the minimization problem The backtracking linear research strategy can also be utilized to explore a more feasible value of L and more detailed 1 1 = arg min (αt+1 αt) ( f(αt+1) f(αt)) 2 analyses on FISTA can be found in the literature [82, 83]. τ t+1 kτ − − ∇ − ∇ k2 (αt+1 αt)T ( f(αt+1) f(αt)) Algorithm 5. Fast Iterative shrinkage thresholding algorithm (FISTA) = − ∇ − ∇ αt+1 αt T αt+1 αt Task: To address the problem αˆ = argmin F (α) = 1 Xα y 2 + ( ) ( ) 2 k − k2 − − (VI.20) λ α 1 k k Input: Probe sample y, the measurement matrix X, small constant λ 0 T For problem VI.14, SpaRSA requires that the value of Initialization: t = 0, µ = 1, L = 2Λmax(X X), i.e. Lipschitz constant of f. λ is a decreasing sequence using the Eq. VI.18 and the ∇ While not converged do value of τ should meet the condition of Eq. VI.20. The Step 1: Exploit the shrinkage operator in equation VI.7 to solve problem VI.16. sparse reconstruction by separable approximation (SpaRSA) t 2 t+1 1+√1+4(µ ) is summarized in Algorithm 6 and more information can be Step 2: Update the value of µ using µ = 2 . Step 3: Update iteration sequence αt using equation VI.17. found in the literature [84]. End Output: α Algorithm 6. Sparse reconstruction by separable approximation (SpaRSA) Task: To address the problem 1 2 αˆ = argmin F (α)= Xα y + λ α 1 2 k − k2 k k D. Sparse reconstruction by separable approximation Input: Probe sample y, the measurement matrix X, small constant λ Initialization: , , y0 y, 1 α T , tolerance t = 0 i = 0 = τ 0 I Hf ( )= X X (SpaRSA) 5 ≈ ε = 10− . T t Sparse reconstruction by separable approximation (SpaRSA) Step 1: λt = max γ X y , λ . Step 2: Exploit shrinkage{ k operatork∞ to} solve problem VI.14, i.e. [84] is another typical proximity algorithm based on sparse i+1 i i T T t i α = shrink(α τ X (X α y), λtτ ). representation, which can be viewed as an accelerated version Step 3: Update the value of− 1 using the− Eq. VI.20. τ i+1 of ISTA. SpaRSA provides a general algorithmic framework αi+1 αi Step 4: If k − k ε, go to step 5; Otherwise, return to step 2 αi ≤ for solving the sparse representation problem and here a and i = i + 1. simple specific SpaRSA with adaptive continuation on ISTA Step 5: yt+1 = y Xαt+1. − is introduced. The main contributions of SpaRSA are trying Step 6: If λt = λ, stop; Otherwise, return to step 1 and t = t + 1. Output: αi to optimize the parameter λ in problem VI.8 by using the worm-starting technique, i.e. continuation, and choosing a JOURNAL 14

E. l1/2-norm regularization based sparse representation where

Sparse representation with the lp-norm (0 (λτ) lp < < h 1 (x )= 2 4 (VI.29) λτ, 2 i | | resentation with the l1/2-norm regularization [87]. Moreover, ( 0, otherwise they have proposed some effective methods to solve the - l1/2 √3 2 54 3 norm regularization problem [60, 88]. where the threshold 4 (λτ) has been conceived and In this section, a half proximal algorithm is introduced demonstrated in the literature [60]. Therefore, if Eq. VI.29 is applied to Eq. VI.27, the half to solve the l1/2-norm regularization problem [60], which matches the iterative shrinkage thresholding algorithm for the proximal thresholding function, instead of the resolvent oper- ator, for the l -norm regularization problem VI.25 can be l1-norm regularization discussed above and the iterative hard 1/2 explicitly reformulated as: thresholding algorithm for the l0-norm regularization. Sparse representation with the l -norm regularization is explicitly 1/2 α = H 1 (θ(α)) (VI.30) to solve the problem as follows: λτ, 2 where the half proximal thresholding operator H [60] is αˆ = arg min F (α)= Xα y 2 + λ α 1/2 (VI.21) { k − k2 k k1/2} deductively constituted by Eq. VI.29. where the first-order optimality condition of F (α) on α can Up to now, the half proximal thresholding algorithm has be formulated as been completely structured by Eq. VI.30. However, the options of the regularization parameter λ in Eq. VI.24 can seriously T λ 1/2 F (α)= X (Xα y)+ ( α 1/2)=0 (VI.22) dominate the quality of the representation solution in problem ∇ − 2 ∇ k k VI.21, and the values of λ and τ can be specifically fixed by which admits the following equation: 1 ε √96 3 λ 1/2 τ = − and λ = [θ(α)] 2 (VI.31) XT (y Xα)= ( α ) (VI.23) X 2 9τ | k+1| − 2 ∇ k k1/2 k k where ε is a very small constant, which is very close to where ( α 1/2) denotes the gradient of the regularization 1/2 zero, the k denotes the limit of sparsity (i.e. k-sparsity), and ∇ 1k/2k term α . Subsequently, an equivalent transformation of [ ] refers to the k-th largest component of [ ]. The half k k1/2 • k • Eq. VI.23 is made by multiplying a positive constant τ and proximal thresholding algorithm for l1/2-norm regularization adding a parameter α to both sides. That is, based sparse representation is summarized in Algorithm 7 and more detailed inferences and analyses can be found in the λ literature [60, 88]. α + τXT (y Xα)= α + τ ( α 1/2) (VI.24) − 2 ∇ k k1/2 Algorithm 7. The half proximal thresholding algorithm for l1/2-norm To this end, the resolvent operator [60] is introduced to regularization compute the resolvent solution of the right part of Eq. VI.24, Task: To address the problem and the resolvent operator is defined as αˆ = argmin F (α)= Xα y 2 + λ α 1/2 k − k2 k k1/2 1 Input: Probe sample y, the measurement matrix X λτ 1/2 − 1 ε Initialization: t = 0, ε = 0.01, τ = − . Rλ, 1 ( )= I + ( ) (VI.25) X 2 2 • 2 ∇ k•k1/2 k k   While not converged do Step 1: Compute θ(αt)= αt + τXT (y Xαt). which is very similar to the inverse function of the right part 3 √96 t − Step 2: Compute λt = [θ(α )] 2 in Eq. VI.31. of Eq. VI.24. The resolvent operator is always satisfied no 9τ | k+1| 1/2 Step 3: Apply the half proximal thresholding operator to obtain t matter whether the resolvent solution of ( 1/2) exists the representation solution αt+1 = H 1 (θ(α )). ∇ k•k λtτ, 2 or not [60]. Applying the resolvent operator to solve problem Step 4: t = t + 1. VI.24 End Output: α λτ 1/2 1 t α = (I + ( ))− (α + τX (y Xα)) 2 ∇ k•k1/2 − (VI.26) T = Rλ,1/2(α + τX (y Xα)) − F. Augmented Lagrange Multiplier based optimization strat- T can be obtained which is well-defined. θ(α)= α + τX (y egy Xα) is denoted and the resolvent operator can be explicitly− The Lagrange multiplier is a widely used tool to eliminate expressed as: the equality constrained problem and convert it to address the T R 1 (x) = (f 1 (x1),f 1 (x2), ,f 1 (xN )) (VI.27) unconstrained problem with an appropriate penalty function. λ, 2 λ, 2 λ, 2 ··· λ, 2 JOURNAL 15

Specifically, the sparse representation problem III.9 can be Second, consider the Lagrange dual problem of problem viewed as an equality constrained problem and the equivalent III.9 and its dual function is problem III.12 is an unconstrained problem, which augments T T T g(λ) = inf α 1+λ (y Xα) = λ y sup λ Xα α 1 the objective function of problem III.9 with a weighted con- α {k k − } − α { −k k } straint function. In this section, the augmented Lagrangian (VI.39) method (ALM) is introduced to solve the sparse representation where λ Rd is a Lagrangian multiplier. If the definition of ∈ problem III.9. conjugate function is applied to Eq. VI.37, it can be verified 1 First, the augmented Lagrangian function of problem III.9 that the conjugate function of IB∞ (θ) is α 1. Thus Eq. VI.39 k k is conceived by introducing an additional equality constrained can be equivalently reformulated as function, which is enforced on the Lagrange function in T T g(λ)= λ y I 1 (X λ) (VI.40) problem III.12. That is, − B∞ The Lagrange dual problem, which is associated with the primal problem III.9, is an optimization problem: λ L(α, λ)= α + y Xα 2 s.t. y Xα =0 (VI.32) k k1 2 k − k2 − max λT y s.t. (XT λ) B1 (VI.41) Then, a new optimization problem VI.32 with the form of the λ ∈ ∞ Lagrangain function is reformulated as Accordingly, T T 1 λ 2 T min λ y s.t. z X λ =0, z B (VI.42) arg min L (α, z)= α + y Xα + z (y Xα) λ,z − − ∈ ∞ λ k k1 2 k − k2 − (VI.33) Then, the optimization problem VI.42 can be reconstructed as where z Rd is called the Lagrange multiplier vector or dual ∈ arg min L(λ, z, µ)= λT y µT (z XT λ) variable and Lλ(α, z) is denoted as the augmented Lagrangian λ,z,µ − − − (VI.43) function of problem III.9. The optimization problem VI.33 τ T 2 1 + z X λ 2 s.t. z B is a joint optimization problem of the sparse representation 2 k − k ∈ ∞ coefficient α and the Lagrange multiplier vector z. Problem where µ Rd is a Lagrangian multiplier and τ is a penalty VI.33 is solved by optimizing α and z alternatively as follows: parameter.∈ t+1 t Finally, the dual optimization problem VI.43 is solved and α = arg minLλ(α, z ) a similar alternating minimization idea of PALM can also be λ = arg min( α + y Xα 2 + (zt)T Xα) applied to problem VI.43, that is, k k1 2 k − k2 (VI.34) t+1 t t z = arg min Lτ (λ , z, µ ) z B1 t+1 t t+1 ∞ z = z + λ(y Xα ) (VI.35) ∈ τ − = arg min µT (z XT λt)+ z XT λt 2 1 2 z B∞{− − 2 k − k } where problem VI.34 can be solved by exploiting the FISTA ∈ τ 2 algorithm. Problem VI.34 is iteratively solved and the parame- = arg min z (XT λt + µT ) 2 1 2 z B∞{ 2 k − τ k } ter z is updated using Eq. VI.35 until the termination condition ∈ is satisfied. Furthermore, if the method of employing ALM T t 1 T = PB1 (X λ + µ ) to solve problem VI.33 is denoted as the primal augmented ∞ τ (VI.44) Lagrangian method (PALM) [89], the dual function of problem

III.9 can also be addressed by the ALM algorithm, which is 1 where PB∞ (u) is a projection, or called a proximal operator, denoted as the dual augmented Lagrangian method (DALM) onto B1 and it is also called group-wise soft-thresholding. ∞ [89]. Subsequently, the dual optimization problem III.9 is 1 For example, let x = PB∞ (u), then the i-th component of discussed and the ALM algorithm is utilized to solve it. solution x satisfies x = sign(u ) min u , 1 i i {| i| } First, consider the following equation: t+1 t+1 t λ = arg min Lτ (λ, z , µ ) λ α 1 = max θ, α (VI.36) T t T T τ t+1 T 2 k k θ ∞ 1h i = arg min λ y + (µ ) X λ + z X λ 2 k k ≤ λ {− 2 k − k } which can be rewritten as = Q(λ) (VI.45) α 1 = max θ, α I 1 k k {h i− B∞ } (VI.37) or α = sup θ, α I 1 Take the derivative of Q(λ) with respect to λ and obtain k k1 {h i− B∞ } t+1 T 1 t+1 t λ N λ = (τXX )− (τXz + y Xµ ) (VI.46) where B = x R x p λ and IΩ(x) is a indicator − p { ∈ | k k ≤ } 0 , x Ω µt+1 µt zt+1 T λt+1 (VI.47) function, which is defined as I (x)= . = τ( X ) Ω , x ∈ Ω − − Hence,  ∞ 6∈ The DALM for sparse representation with l1-norm regu- larization mainly exploits the augmented Lagrange method 1 α 1 = max θ, α : θ B (VI.38) to address the dual optimization problem of problem III.9 k k {h i ∈ ∞} JOURNAL 16 and a proximal operator, the projection operator, is utilized the original optimization problems by tracing a continuous to efficiently solve the subproblem. The algorithm of DALM parameterized path of solutions along with varying parameters. is summarized in Algorithm 8. For more detailed description, Having a highly intimate relationship with the conventional please refer to the literature [89]. sparse representation method such as least angle regression (LAR) [42], OMP [64] and polytope faces pursuit (PFP) Algorithm 8. Dual augmented Lagrangian method for l1-norm regulariza- [95], the homotopy algorithm has been successfully employed tion to solve the l1-norm minimization problems. In contrast to Task: To address the dual problem of αˆ = argminα α 1 s.t. y = Xα k k LAR and OMP, the homotopy method is more favorable for sequentially updating the sparse solution by adding or Input: Probe sample y, the measurement matrix X, a small constant λ0. Initialization: , , 1 ε , 0 . removing elements from the active set. Some representative t = 0 ε = 0.01 τ = X− 2 µ = 0 While not converged do k k methods that exploit the homotopy-based strategy to solve the Step 1: Apply the projection operator to compute sparse representation problem with the l1-norm regularization t+1 T t 1 T z = P 1 (X λ + µ ). B∞ τ are explicitly presented in the following parts of this section. t+1 T 1 t+1 t Step 2: Update the value of λ = (τXX )− (τXz +y Xµ ). Step 3: Update the value of µt+1 = µt τ(zt+1 XT λt+1−). Step 4: t = t + 1. − − A. LASSO homotopy End While Output: α = µ[1 : N] Because of the significance of parameters in l1-norm min- imization, the well-known LASSO homotopy algorithm is proposed to solve the LASSO problem in III.9 by tracing the whole homotopy solution path in a range of decreasing G. Other proximity algorithm based optimization methods values of parameter λ. It is demonstrated that problem III.12 The theoretical basis of the proximity algorithm is to first with an appropriate parameter value is equivalent to problem construct a proximal operator, and then utilize the proximal III.9 [29]. Moreover, it is apparent that as we change λ from operator to solve the convex optimization problem. Massive a very large value to zero, the solution of problem III.12 is proximity algorithms have followed up with improved tech- converging to the solution of problem III.9 [29]. The set of niques to improve the effectiveness and efficiency of proximity varying value λ conceives the solution path and any point on algorithm based optimization methods. For example, Elad et the solution path is the optimality condition of problem III.12. al. proposed an iterative method named parallel coordinate More specifically, the LASSO homotopy algorithm starts at descent algorithm (PCDA) [90] by introducing the element- an large initial value of parameter λ and terminates at a wise optimization algorithm to solve the regularized linear point of λ, which approximates zero, along the homotopy with non-quadratic regularization problem. solution path so that the optimal solution converges to the Inspired by belief propagation in graphical models, Donoho solution of problem III.9. The fundamental of the homotopy et al. developed a modified version of the iterative thresholding algorithm is that the homotopy solution path is a piecewise method, called approximate message passing (AMP) method linear path with a discrete number of operations while the [91], to satisfy the requirement that the sparsity undersampling value of the homotopy parameter changes, and the direction tradeoff of the new algorithm is equivalent to the correspond- of each segment and the step size are absolutely determined ing convex optimization approach. Based on the develop- by the sign sequence and the support of the solution on the ment of the first-order method called Nesterov’s smoothing corresponding segment, respectively [96]. framework in convex optimization, Becker et al. proposed a Based on the basic ideas in a convex optimization problem, generalized Nesterov’s algorithm (NESTA) [92] by employing it is a necessary condition that the zero vector should be a the continuation-like scheme to accelerate the efficiency and solution of the subgradient of the objective function of problem flexibility. Subsequently, Becker et al. [93] further constructed III.12. Thus, we can obtain the subgradiential of the objective a general framework, i.e. templates for convex cone solvers function with respect to α for any given value of λ, that is, (TFOCS), for solving massive certain types of compressed sensing reconstruction problems by employing the optimal ∂L T = X (y Xα)+ λ∂ α 1 (VII.1) first-order method to solve the smoothed dual problem of ∂α − − k k the equivalent conic formulation of the original optimization where the first term r = XT (y Xα) is called the vector of − problem. Further detailed analyses and inference information residual correlations, and ∂ α 1 is the subgradient obtained related to proximity algorithms can be found in the literature by k k [28, 83]. N θi = sgn(αi), αi =0 ∂ α 1 = θ R 6 k k ∈ θi [ 1, 1], αi =0  ∈ −  VII. HOMOTOPYALGORITHMBASEDSPARSE Let Λ and u denote the support of α and the sign sequence

REPRESENTATION of α on its support Λ, respectively. XΛ denotes that the indices of all the samples in X are all included in the support set The concept of homotopy derives from topology and the Λ Λ. If we analyze the KKT optimality condition for problem homotopy technique is mainly applied to address a nonlinear III.12, we can obtain the following two equivalent conditions system of equations problem. The homotopy method was of problem VII.1, i.e. originally proposed to solve the least square problem with T the l1-penalty [94]. The main idea of homotopy is to solve XΛ(y Xα)= λu; XΛc (y Xα) λ (VII.2) − k − k∞ ≤ JOURNAL 17 where Λc denotes the complementary set of the set Λ. Thus, condition p =0 is satisfied so that the solution of problem the optimality conditions in VII.2 can be divided into N III.9 is reached.k k∞ The principal steps of the LASSO homotopy constraints and the homotopy algorithm maintains both of the algorithm have been summarized in Algorithm 9. For further conditions along the optimal homotopy solution path for any description and analyses, please refer to the literature [29, 96]. λ 0. As we decrease the value of λ to λ τ, for a small value≥ of τ, the following conditions should be− satisfied Algorithm 9. Lasso homotopy algorithm Task: To addrss the Lasso problem: T T 2 X (y Xα)+ τX Xδ = (λ τ)u (a) αˆ = argminα y Xα s.t. α 1 ε Λ − Λ − (VII.3) k − k2 k k ≤ p + τq λ τ (b) Input: Probe sample y, measurement matrix X. ∞ k k ≤ − Initialization: l = 1, initial solution α and its support set Λ . where p = XT (y Xα), q = XT Xδ and δ is the update l l − Repeat: direction. Step 1: Compute update direction δl by using Eq. (VII.5). + Generally, the homotopy algorithm is implemented itera- Step 2: Compute τl and τl− by using Eq. (VII.6) and Eq. (VII.7). Step 3: Compute the optimal minimum step size τl∗ by using tively and it follows the homotopy solution path by updating + τ = min τ , τ − . the support set by decreasing parameter λ from a large value l∗ { l l } Step 4: Update the solution αl+1 by using αl+1 = αl + τl∗δl. to the desired value. The support set of the solution will be Step 5: Update the support set: + updated and changed only at a critical point of λ, where If τl == τl− then either an existing nonzero element shrinks to zero or a new Remove the i− from the support set, i.e. Λl+1 =Λl i−. else \ nonzero element will be added into the support set. The + + Add the i into the support set, i.e. Λl+1 =Λl i two most important parameters are the step size τ and the End if S T 1 Step 6: l = l + 1. update direction δ. At the l-th stage (if (X XΛ)− exists), Λ Until XT (y Xα) = 0 the homotopy algorithm first calculates the update direction, k − k∞ Output: αl+1 which can be obtained by solving T XΛ XΛδl = u (VII.4) Thus, the solution of problem VII.4 can be written as B. BPDN homotopy Problem III.11, which is called basis pursuit denoising (XT X ) 1u, on Λ δ = Λ Λ − (VII.5) (BPDN) in signal processing, is the unconstrained Lagrangian l 0, otherwise  function of the LASSO problem III.9, which is an uncon- Subsequently, the homotopy algorithm computes the step strained problem. The BPDN homotopy algorithm is very size τ to the next critical point by tracing the homotopy similar to the LASSO homotopy algorithm. If we consider the solution path. i.e. the homotopy algorithm moves along the KKT optimality condition for problem III.12, the following update direction until one of constraints in VII.3 is not condition should be satisfied for the solution α satisfied. At this critical point, a new nonzero element must XT (y Xα) λ (VII.8) enter the support Λ, or one of the nonzero elements in α k − k∞ ≤ will be shrink to zero, i.e. this element must be removed As for any given value of λ and the support set Λ, the from the support set Λ. Two typical cases may lead to a new following two conditions also need to be satisfied critical point, where either condition of VII.3 is violated. The minimum step size which leads to a critical point can be easily T T + + XΛ (y Xα)= λu; XΛc (y Xα) λ (VII.9) obtained by computing τl∗ = min(τl , τl−), and τl and τl− − k − k∞ ≤ are computed by The BPDN homotopy algorithm directly computes the ho- motopy solution by + λ pi λ + pi τ = mini Λc − , (VII.6) l ∈ T T 1 xi XΛδl 1+ xi XΛδl +  −  T 1 T (X X )− (X y λu), on Λ i α = Λ Λ Λ − (VII.10) αl 0, otherwise τ − = mini Λ − (VII.7) l ∈ δi   l + which is somewhat similar to the soft-thresholding operator. T i where pi = xi (y xiαl) and min( )+ denotes that the The value of the homotopy parameter λ is initialized with a − · + T minimum is operated over only positive arguments. τl is the large value, which satisfies λ0 > X y . As the value of minimum step size that turns an inactive element at the index the homotopy parameter λ decreases,k thek∞ BPDN homotopy + + T 1 i in to an active element, i.e. the index i should be added algorithm traces the solution in the direction of (XΛ XΛ)− u into the support set. τl− is the minimum step size that shrinks till the critical point is obtained. Each critical point is reached the value of a nonzero active element to zero at the index i− when either an inactive element is transferred into an active and the index i− should be removed from the support set. The element, i.e. its corresponding index should be added into solution is updated by αl+1 = αl + τl∗δ, and its support and the support set, or an nonzero active element value in α sign sequence are renewed correspondingly. shrinks to zero, i.e. its corresponding index should be removed The homotopy algorithm iteratively computes the step size from the support set. Thus, at each critical point, only one and the update direction, and updates the homotopy solution element is updated, i.e. one element being either removed and its corresponding support and sign sequence till the from or added into the active set, and each operation is very JOURNAL 18 computationally efficient. The algorithm is terminated when homotopy parameter by tracing the homotopy solution path. the value of the homotopy parameter is lower than its desired Similar to the LASSO homotopy algorithm, problem VII.14 value. The BPDN homotopy algorithm has been summarized is also piecewise linear along the homotopy path, and for any in Algorithm 10. For further detail description and analyses, value of σ, the following conditions should be satisfied please refer to the literature [42]. T xi (Xα y)= ((1 σ)wi + σwˆi)ui for i Λ (a) xT (Xα− y) <−(1 −σ)w + σwˆ for i ∈Λc (b) Algorithm 10. BPDN homotopy algorithm | i − | − i i ∈ Task: To address the Lasso problem: (VII.15) 2 αˆ = argminα y Xα + λ α 1 where xi is the i-th column of the measurement X, wi and k − k2 k k Input: Probe sample y, measurement matrix X. wˆi are the given weight and new obtained weight, respectively. Initialization: l = 0, initial solution α0 and its support set Λ0, a large Moreover, for the optimal step size σ, when the homotopy value λ0, step size τ, tolerance ε. parameter changes from σ to σ + τ in the update direction δ, Repeat: the following optimality conditions also should be satisfied Step 1: Compute update direction δl+1 by using T 1 δl+1 = (XΛ XΛ)− ul . Step 2: Update the solution αl+1 by using Eq. (VII.10). T α y T δ Step 3: Update the support set and the sign sequence set. XΛ (X )+ τXΛ X = − ˆ ˆ Step 6: λl+1 = λl τ, l = l + 1. ((1 σ)W + σW )u + τ(W W )u (a) (VII.16) Until λ ε − − − − ≤ p τq r + τs (b) Output: αl+1 | − |≤ where u is the sign sequence of α on its support Λ, pi = T T xi (Xα y), qi = xi Xδ, ri = (1 σ)wi + σwˆi and si = wˆ w .− Thus, at the l-th stage (if−(XT X ) 1 exists), the C. Iterative Reweighting -norm minimization via homotopy i i i i − l1 update− direction of the homotopy algorithm can be computed Based on the homotopy algorithm, Asif and Romberg [96] by presented a enhanced sparse representation objective function, T 1 (X X )− (W Wˆ )u, on Λ a weighted l1-norm minimization, and then provided two fast δ = Λ Λ (VII.17) l 0,− otherwise and accurate solutions, i.e. the iterative reweighting algorithm,  which updated the weights with a new ones, and the adaptive The step size which can lead to a critical point can be com- reweighting algorithm, which adaptively selected the weights + + puted by τl∗ = min(τl , τl−), and τl and τl− are computed in each iteration. Here the iterative reweighting algorithm by via homotopy is introduced. The objective function of the + ri pi ri pi τl = mini Λc − , − − (VII.18) weighted l1-norm minimization is formulated as ∈ q s q + s  i − i i i + 1 i 2 αl argmin Xα y 2 + W α 1 (VII.11) τ − = mini Λ − (VII.19) 2k − k k k l ∈ δi  l + where W = diag[w1, w2, , wN ] is the weight of the ··· where τ + is the minimum step size so that the index i+ l1-norm and also is a diagonal matrix. For more explicit l description, problem VII.11 can be rewritten as should be added into the support set and τl− is the minimum step size that shrinks the value of a nonzero active element to N zero at the index i−. The solution and homotopy parameter 1 2 argmin Xα y + wi αi (VII.12) are updated by α = α + τ δ, and σ = σ + τ , 2k − k2 | | l+1 l l∗ l+1 l l∗ i=1 X respectively. The homotopy algorithm updates its support set A common method [42, 73] to update the weight W is and sign sequence accordingly until the new critical point achieved by exploiting the solution of problem VII.12, i.e. α, of the homotopy parameter σl+1 = 1. The main steps of at the previous iteration, and for the i-th element of the weight this algorithm are summarized in Algorithm 11 and more wi is updated by information can be found in literature [96]. λ w = (VII.13) i α + σ | i| D. Other homotopy algorithms for sparse representation where parameters λ and σ are both small constants. In order to The general principle of the homotopy method is to reach the efficiently update the solution of problem (7-9), the homotopy optimal solution along with the homotopy solution path by algorithm introduces a new weight of the l1-norm and a evolving the homotopy parameter from a known initial value new homotopy based reweighting minimization problem is to the final expected value. There are extensive hotomopy reformulated as algorithms, which are related to the sparse representation with N the l1-norm regularization. Malioutov et al. first exploited the 1 2 argmin Xα y 2 + ((1 σ)ˆwi + σwˆi) αi (VII.14) homotopy method to choose a suitable parameter for -norm 2k − k − | | l1 i=1 X regularization with a noisy term in an underdetermined system where wˆi denotes the new obtained weight by the homotopy and employed the homotopy continuation-based method to algorithm, parameter τ is denoted as the homotopy parameter solve BPDN for sparse signal processing [97]. Garrigues and varying from 0 to 1. Apparently, problem VII.14 can be Ghaoui [98] proposed a modified homotopy algorithm to solve evolved to problem VII.12 with the increasing value of the the Lasso problem with online observations by optimizing the JOURNAL 19

Algorithm 11. Iterative reweighting homotopy algorithm for weighted l1- over-complete dictionary that can lead to sparse representation norm minimization is usually achieved by exploiting pre-specified set of trans- Task: To addrss the weighted l1-norm minimization: 1 2 formation functions, i.e. transform domain method [5], or is αˆ = arg min 2 Xα y 2 + W α 1 k − k k k devised based on learning, i.e. dictionary learning methods Input: Probe sample y, measurement matrix X. Initialization: l = 1, initial solution αl and its support set Λl, σ1 = 0. [104]. Both of the transform domain and dictionary learning Repeat: based methods transform image samples into other domains Step 1: Compute update direction δl by using Eq. (VII.17). and the similarity of transformation coefficients are exploited Step 2: Compute p, q, r and s by using Eq. (VII.16). + [105]. The difference between them is that the transform Step 2: Compute τ and τ − by using Eq. (VII.18) and Eq. (VII.19). l l domain methods usually utilize a group of fixed transfor- Step 3: Compute the step size τl∗ by using + τl∗ = min τl , τl− . mation functions to represent the image samples, whereas Step 4: Update the{ solution α} by using α α δ . l+1 l+1 = l + τl∗ l the dictionary learning methods apply sparse representations Step 5: Update the support set: + on a over-complete dictionary with redundant information. If τl == τl− then Shrink the value to zero at the index i− and remove i−, Moreover, exploiting the pre-specified transform matrix in i.e. Λl+1 =Λl i−. transform domain methods is attractive because of its fast and else \ + + simplicity. Specifically, the transform domain methods usually Add the i into the support set, i.e. Λl+1 =Λl i End if S represent the image patches by using the orthonormal basis Step 6: σl+1 = σl + τl and l = l + 1. such as over-complete wavelets transform [106], super-wavelet Until σl+1 = 1 transform [107], bandelets [108], curvelets transform [109], Output: α l+1 contourlets transform [110] and steerable wavelet filters [111]. However, the dictionary learning methods exploiting sparse representation have the potential capabilities of outperforming homotopy parameter from the current solution to the solution the pre-determined dictionaries based on transformation func- after obtaining the next new data point. Efron et al. [42] pro- tions. Thus, in this subsection we only focus on the modern posed a basic pursuit denoising (BPDN) homotopy algorithm, over-complete dictionary learning methods. which shrinked the parameter to a final value with series An effective dictionary can lead to excellent reconstruction of efficient optimization steps. Similar to BPDN homotopy, results and satisfactory applications, and the choice of dictio- Asif [99] presented a homotopy algorithm for the Dantzing nary is also significant to the success of sparse representation selector (DS) under the consideration of primal and dual technique. Different tasks have different dictionary learning solution. Asif and Romberg [100] proposed a framework of rules. For example, image classification requires that the dynamic updating solutions for solving l -norm minimization 1 dictionary contains discriminative information such that the programs based on homotopy algorithm and demonstrated its solution of sparse representation possesses the capability of effectiveness in addressing the decoding issue. More recent distinctiveness. The purpose of dictionary learning is moti- literature related to homotopy algorithms can be found in the vated from sparse representation and aims to learn a faithful streaming recovery framework [101] and a summary [102]. and effective dictionary to largely approximate or simulate the specific data. In this section, some parameters are defined as HE APPLICATIONS OF THE SPARSE T VIII. T matrix Y = [y1, y2, , yN ], matrix X = [x1, x2, , xt] , REPRESENTATION METHOD and dictionary D =··· [d , d , , d ]. ··· 1 2 ··· M Sparse representation technique has been successfully im- From the notations of the literature [22, 112], the framework plemented to numerous applications, especially in the fields of dictionary learning can be generally formulated as an of computer vision, image processing, pattern recognition optimization problem and . More specifically, sparse representation N has also been successfully applied to extensive real-world 1 1 2 arg min ( yi Dxi 2 + λP (xi)) (VIII.1) D Ω,xi N 2k − k applications, such as image denoising, deblurring, inpainting, ∈ ( i=1 ) super-resolution, restoration, quality assessment, classification, X T segmentation, signal processing, object tracking, texture clas- where Ω = D = [d1, d2, , dM ]: di di = 1,i = 1, 2, ,M ({M here may not··· be equal to N), N denotes sification, image retrieval, bioinformatics, biometrics and other ··· } artificial intelligence systems. Moreover, dictionary learning is the number of the known data set (eg. training samples in one of the most typical representative examples of sparse rep- image classification), yi is the i-th sample vector from a resentation for realizing the sparse representation of a signal. known set, D is the learned dictionary and xi is the sparsity In this paper, we only concentrate on the three applications of vector. P (xi) and λ are the penalty or regularization term sparse representation, i.e. sparse representation in dictionary and a tuning parameter, respectively. The regularization term learning, image processing, image classification and visual of problem VIII.1 controls the degree of sparsity. That is, tracking. different kinds of the regularization terms can immensely dominate the dictionary learning results. One spontaneous idea of defining the penalty term P (xi) A. Sparse representation in dictionary learning is to introduce the l0-norm regularization, which leads to The history of modeling dictionary could be traced back to the sparsest solution of problem VIII.1. As a result, the 1960s, such as the fast Fourier transform (FFT) [103]. An theory of sparse representation can be applied to dictionary JOURNAL 20 learning. The most representative dictionary learning based such as hierarchical sparse dictionary learning [123] and group on the l0-norm penalty is the K-SVD algorithm [8], which or block sparse dictionary learning [124]. Recently, some is widely used in image denoising. Because the solution of researchers [22] categorized the latest methods of dictionary l0-norm regularization is usually a NP-hard problem, utilizing learning into four groups, online dictionary learning [125], a convex relaxation strategy to replace l0-norm regularization joint dictionary learning [126], discriminative dictionary learn- is an advisable choice for dictionary learning. As a convex ing [127], and supervised dictionary learning [128]. relaxation method of l0-norm regularization, the l1-norm reg- Although there are extensive strategies to divide the avail- ularization based dictionary learning has been proposed in able sparse representation based dictionary learning methods large numbers of dictionary learning schemes. In the stage into different categories, the strategy used here is to categorize of convex relaxation methods, there are three optimal forms the current prevailing dictionary learning approaches into two for updating a dictionary: the one by one atom updating main classes: supervised dictionary learning and unsupervised method, group atoms updating method, and all atoms updating dictionary learning, and then specific representative algorithms method [112]. Furthermore, because of over-penalization in l1- are explicitly introduced. norm regularization, non-convex relaxation strategies also have 1) Unsupervised dictionary learning: From the viewpoint been employed to address dictionary learning problems. For of theoretical basis, the main difference of unsupervised and example, Fan and Li proposed a smoothly clipped absolution supervised dictionary learning relies on whether the class deviation (SCAD) penalty [113], which employed an iterative label is exploited in the process of learning for obtaining approximate Newton-Raphson method for penalizing least the dictionary. Unsupervised dictionary learning methods have sequences and exploited the penalized likelihood approaches been widely implemented to solve image processing problems, for variable selection in linear regression models. Zhang such as image compression, and feature coding of image introduced and studied the non-convex minimax concave (MC) representation [129, 130]. family [114] of non-convex piecewise quadratic penalties to (1) KSVD for unsupervised dictionary learning make unbiased variable selection for the estimation of regres- One of the most representative unsupervised dictionary sion coefficients, which was demonstrated its effectiveness learning algorithms is the KSVD method [122], which is a by employing an oracle inequality. Friedman proposed to modification or an extension of method of directions (MOD) use the logarithmic penalty for a model selection [115] and algorithm. The objective function of KSVD is used it to solve the minimization problems with non-convex 2 arg min Y DX F s.t. xi 0 k, i =1, 2, ,N regularization terms. From the viewpoint of updating strategy, D,X{k − k } k k ≤ ··· most of the dictionary learning methods always iteratively (VIII.2) Rd N update the sparse approximation or representation solution where Y × is the matrix composed of all the known ∈ d N N N examples, D R × is the learned dictionary, X R × and the dictionary alternatively, and more dictionary learning ∈ ∈ theoretical results and analyses can be found in the literature is the matrix of coefficients, k is the limit of sparsity and xi [116, 117]. denotes the i-th row vector of the matrix X. Problem VIII.2 Recently, varieties of dictionary learning methods have been is a joint optimization problem with respect to D and X, and proposed and researchers have attempted to exploit different the natural method is to alternatively optimize the D and X strategies for implementing dictionary learning tasks based on iteratively. sparse representation. There are several means to categorize Algorithm 12. The K-SVD algorithm for dictionary learning these dictionary learning algorithms into various groups. For 2 Task: Learning a dictionary D: arg minD,X Y DX F s.t. xi 0 example, dictionary learning methods can be divided into k, i = 1, 2, ,N k − k k k ≤ three groups in the context of different norms utilized in the ··· Input: The matrix composed of given samples Y =[y1, y2, , ym]. ··· penalty term, that is, l0-norm regularization based methods, Initialization: Set the initial dictionary D to the l2–norm unit matrix, convex relaxation methods and non-convex relaxation methods i = 1. [118]. Moreover, dictionary learning algorithms can also be While not converged do Step 1: For each given example yi, employing the classical sparse divided into three other categories in the presence of different representation with l0-norm regularization to solve problem VIII.3 structures. The first category is dictionary learning under the for further estimating Xi,set l = 1. probabilistic framework such as maximum likelihood methods While l is not equal to k do Step 2: Compute the overall representation residual [119], the method of optimal directions (MOD) [120], and the T El = Y j=l dj xj . − P 6 maximum a posteriori probability method [121]. The second Step 3: Extract the column items of El which corresponds T P category is clustering based dictionary learning approaches to the nonzero elements of xl and obtain El . P P T such as KSVD [122], which can be viewed as a generalization Step 4: SVD decomposes El into El = UΛV . Step 5: Update dl to the first column of U and update of K-means. The third category is dictionary learning with T corresponding coefficients in xl by Λ(1, 1) times the certain structures, which are grouped into two significative first column of V . aspects, i.e. directly modeling the relationship between each Step 6: l = l + 1. atom and structuring the corrections between each atom with End While Step 7: i = i + 1. purposive sparsity penalty functions. There are two typical End While models for these kinds of dictionary learning algorithms, Output: dictionary D sparse and shift-invariant representation of dictionary learning and structure sparse regularization based dictionary learning, More specifically, when fixing dictionary D, problem VIII.2 JOURNAL 21 is converted to where µ is a small constant as a regularization parameter for adjusting the weighting decay speed, is the operator of the ⊙ RN 1 2 element-wise multiplication, xi is the code for yi, 1 × arg min Y DX F s.t. xi 0 k, i =1, 2, ,N ∈ X k − k k k ≤ ··· is defined as a vector with all elements as 1 and vector b is (VIII.3) the locality adaptor, which is, more specifically, set as which is called sparse coding and k is the limit of sparsity. Then, its subproblem is considered as follows: dist(y ,D) b = exp i (VIII.7) 2 σ arg min yi Dxi 2 s.t. xi 0 k, i =1, 2, ,N   xi k − k k k ≤ ··· where dist(y ,D) = [dist(y , d ), ,dist(y , d )] and where we can iteratively resort to the classical sparse repre- i i 1 ··· i N dist(yi, dj ) denotes the distance between yi and dj with sentation with l0 -norm regularization such as MP and OMP, different distance metrics, such as Euclidean distance and for estimating x . i Chebyshev distance. Specifically, the i-th value of vector b When fixing X, problem VIII.3 becomes a simple regres- is defined as b = exp dist(yi,di) . sion model for obtaining D, that is i σ The K-Means clustering algorithm is applied to generate the ˆ 2 D = arg min Y DX F (VIII.4) codebook D, and then the solution of LLC can be deduced D k − k as: T T 1 where Dˆ = YX† = YX (XX )− and the method is called 2 xˆi = (Ci + µ diag (b)) 1 (VIII.8) MOD. Considering that the computational complexity of the \ in solving problem VIII.4 is O(n3), it is T favorable, for further improvement, to update dictionary D by xi = xˆi/ 1 xˆi (VIII.9) fixing the other variables. The strategy of the KSVD algorithm 1 T rewrites the problem VIII.4 into where the operator a b denotes a− b, and Ci = (D 1yT )(DT 1yT )T is\ the covariance matrix with respect− N i − i ˆ 2 T 2 to yi. This is called the LLC algorithm. Furthermore, the D = arg min Y DX F = arg min Y djxj F D k − k D k − k incremental codebook optimization algorithm has also been j=1 X proposed to obtain a more effective and optimal codebook, T T 2 = arg min (Y dj xj ) dlxl F and the objective function is reformulated as D k − − k j=l X6 N (VIII.5) 2 2 arg min yi Dxi 2 + µ b xi 2 xi,D k − k k ⊙ k (VIII.10) where xj is the j-th row vector of the matrix X. First i=1 T X T 2 the overall representation residual El = Y j=l djxj s.t. 1 x =1, i; d 1, j − 6 i j 2 is computed, and then dl and xl are updated. In order to ∀ k k ≤ ∀ T P Actually, the problem VIII.10 is a process of feature extrac- maintain the sparsity of xl in this step, only the nonzero T tion and the property of ‘locality’ is achieved by constructing elements of xl should be preserved and only the nonzero P T a local coordinate system by exploiting the local bases for items of El should be reserved, i.e. El , from dlxl . Then, P P T each descriptor, and the local bases in the algorithm are SVD decomposes El into El = UΛV , and then updates dictionary dl. The specific KSVD algorithm for dictionary simply obtained by using the K nearest neighbors of yi. learning is summarized to Algorithm 12 and more information The incremental codebook optimization algorithm in problem can be found in the literature [122]. VIII.10 is a joint optimization problem with respect to D (2) Locality constrained linear coding for unsupervised dictio- and xi, and it can be solved by iteratively optimizing one nary learning when fixing the other alternatively. The main steps of the The locality constrained linear coding (LLC) algorithm incremental codebook optimization algorithm are summarized [130] is an efficient local coordinate linear coding method, in Algorithm 13 and more information can be found in the which projects each descriptor into a local constraint system literature [130]. to obtain an effective codebook or dictionary. It has been (3) Other unsupervised dictionary learning methods demonstrated that the property of locality is more essential A large number of different unsupervised dictionary learn- than sparsity, because the locality must lead to sparsity but ing methods have been proposed. The KSVD algorithm and not vice-versa, that is, a necessary condition of sparsity is LLC algorithm are only two typical unsupervised dictio- locality, but not the reverse [130]. nary learning algorithms based on sparse representation. Ad- d N Assume that Y = [y1, y2, , yN ] R × is a matrix ditionally, Jenatton et al. [123] proposed a tree-structured composed of local descriptors··· extracted∈ from examples and dictionary learning problem, which exploited tree-structured d N the objective dictionary D = [d1, d2, , dN ] R × . The sparse regularization to model the relationship between each objective function of LLC is formulated··· as ∈ atom and defined a proximal operator to solve the primal- dual problem. Zhou et al. [131] developed a nonparametric N Bayesian dictionary learning algorithm, which utilized hierar- 2 2 chical Bayesian to model parameters and employed the trun- arg min yi Dxi 2 + µ b xi 2 xi,D k − k k ⊙ k (VIII.6) i=1 cated beta-Bernoulli process to learn the dictionary. Ramirez X s.t. 1T x =1, i =1, 2, ,N and Shapiro [132] employed minimum description length to i ··· JOURNAL 22

Algorithm 13. The incremental codebook optimization algorithm (1) Discriminative KSVD for dictionary learning Task: Learning a dictionary D: arg min N y Dx 2+µ b xi,D i=1 i i 2 Discriminative KSVD (DKSVD) [127] was designed to x 2 s.t. 1T x = 1, i; d 2 1, j P k − k k ⊙ ik2 i ∀ k j k2 ≤ ∀ solve image classification problems. Considering the priori- Input: The matrix composed of given samples Y =[y1, y2, , yN ]. ties of supervised learning theory in classification, DKSVD Initialization: i = 1, ε = 0.01, D initialized by K-Means··· clustering algorithm. incorporates the dictionary learning with discriminative infor- While i is not equal to N do mation and classifier parameters into the objective function Step 1: Initialize b with 1 N zero vector. and employs the KSVD algorithm to obtain the global optimal Step 2: Update locality constraint× parameter b with y d solution for all parameters. The objective function of the b dist( i, j ) for . j = exp  σ  j − ∀ b b DKSVD algorithm is formulated as Step 3: Normalize b using the equation b = − min . bmax b − min Step 4: Exploit the LLC coding algorithm to obtain xi. Step 5: Keep the set of Di, whose corresponding entries of the code x 2 2 i D,C,X = arg min Y DX F + µ H CX F are greater than ε, and drop out other elements, i.e. h i D,C,X k − k k − k i index j abs xi(j) >ε j and D D(:, index). +η C 2 s.t. x k ← { | { } } ∀ i ←2 1T F i 0 Step 6: Update xi exploiting arg max yi D xi 2 s.t. xi = 1. k k k k ≤ Step 7: Update dictionary D using ak classical− gradientk descent method (VIII.11) with respect to problem VIII.6. Step 8: i = i + 1. where Y is the given input samples, D is the learned dictio- End While nary, X is the coefficient term, H is the matrix composed of Output: dictionary D label information corresponding to Y , C is the parameter term for classifier, and η and µ are the weights. With a view to the framework of KSVD, problem VIII.11 can be rewritten as model an effective framework of sparse representation and dictionary learning, and this framework could conveniently Y D 2 incorporate prior information into the process of sparse rep- D,C,X = arg min X F h i D,C,X k √µH − √µC k resentation and dictionary learning. Some other unsupervised     +η C 2 s.t. x k dictionary learning algorithms also have been validated. Mairal F i 0 k k k k(VIII.12)≤ et al. proposed an online dictionary learning [133] algorithm based on stochastic approximations, which treated the dictio- In consideration of the KSVD algorithm, each column of nary learning problem as the optimization of a smooth convex the dictionary will be normalized to l2-norm unit vector and problem over a convex set and employed an iterative online D will also be normalized, and then the penalty algorithm at each step to solve the subproblems. Yang and √µC   Zhang proposed a sparse variation dictionary learning (SVDL) term C 2 will be dropped out and problem VIII.12 will be k kF algorithm [134] for face recognition with a single training reformulated as sample, in which a joint learning framework of adaptive projection and a sparse variation dictionary with sparse bases 2 Z,X = arg min W ZX F s.t. xi 0 k (VIII.13) were simultaneously constructed from the gallery image set to h i Z,X k − k k k ≤ the generic image set. Shi et al. proposed a minimax concave Y D penalty based sparse dictionary learning (MCPSDL) [112] where W = , Z = and apparently the algorithm, which employed a non-convex relaxation online √µH √µC formulation VIII.13 is the same as the framework of KSVD scheme, i.e. a minimax concave penalty, instead of using [122] in Eq. VIII.2 and it can be efficiently solved by the regular convex relaxation approaches as approximation of l - 0 KSVD algorithm. norm penalty in sparse representation problem, and designed a coordinate descend algorithm to optimize it. Bao et al. More specifically, the DKSVD algorithm contains two main proposed a dictionary learning by proximal algorithm (DLPM) phases: the training phase and classification phase. For the [118], which provided an efficient alternating proximal algo- training phase, Y is the matrix composed of the training sam- ples and the objective is to learn a discriminative dictionary D rithm for solving the l0-norm minimization based dictionary learning problem and demonstrated its global convergence and the classifier parameter C. DKSVD updates Z column by property. column and for each column vector zi, DKSVD employs the KSVD algorithm to obtain z and its corresponding weight. 2) Supervised dictionary learning: Unsupervised dictionary i Then, the DKSVD algorithm normalizes the dictionary D and learning just considers that the examples can be sparsely classifier parameter C by represented by the learned dictionary and leaves out the label information of the examples. Thus, unsupervised dictionary d1 d2 dM learning can perform very well in data reconstruction, such D′ = [d′ , d′ , , d′ ] = [ , , , ] 1 2 ··· M d1 d2 ··· dM as image denoising and image compressing, but is not ben- C = [c , c , , c ] = [ kc1 k, kc2 k, , kcM k] ′ 1′ 2′ M′ d1 d2 dM ··· k k k k ··· k k eficial to perform classification. On the contrary, supervised xi′ = xi di dictionary learning embeds the class label into the process of ×k k (VIII.14) sparse representation and dictionary learning so that this leads For the classification phase, Y is the matrix composed of to the learned dictionary with discriminative information for the test samples. Based on the obtained learning results D′ effective classification. and C′, the sparse coefficient matrix xˆi can be obtained for JOURNAL 23

each test sample yi by exploiting the OMP algorithm, which is to solve 2 Z,X = arg min T ZX 2 s.t. xi 0 k (VIII.18) h i Z,X k − k k k ≤ 2 Y D xˆi = arg min yi D′xi′ 2 s.t. xi′ 0 k (VIII.15) k − k k k ≤ where T = √µL , Z = √µA .     On the basis of the corresponding sparse coefficient xˆi, the √ηH √ηC final classification, for each test sample yi, can be performed The learning process of the LC-KSVD algorithm, as is by judging the label result by multiplying xˆi by classifier C′, DKSVD, can be separated into two sections, the training that is, term and the classification term. In the training section, since problem VIII.18 completely satisfies the framework of KSVD, label = C′ xˆi (VIII.16) × the KSVD algorithm is applied to update Z atom by atom and where the label is the final predicted label vector. The class compute X. Thus Z and X can be obtained. Then, the LC- label of yi is the determined class index of label. KSVD algorithm normalizes dictionary D, transform matrix The main highlight of DKSVD is that it employs the A, and the classifier parameter C by framework of KSVD to simultaneously learn a discriminative d1 d2 dM dictionary and classifier parameter, and then utilizes the effi- D′ = [d′ , d′ , , d′ ] = [ , , , ] 1 2 ··· M d1 d2 ··· dM cient OMP algorithm to obtain a sparse representation solution A = [a , a , , a ] = [ka1k , ka2k , , kaMk ] ′ 1′ 2′ M′ d1 d2 dM and finally integrate the sparse solution and learned classifier ··· kc1 k kc2 k ··· kcM k C′ = [c′ , c′ , , c′ ] = [ , , , ] for ultimate effective classification. 1 2 M d1 d2 dM ··· k k k k ··· k (VIII.19)k (2) Label consistent KSVD for discriminative dictionary learn- In the classification section, Y is the matrix composed of ing the test samples. On the basis of the obtained dictionary D′, Because of the classification term, a competent dictionary the sparse coefficient xˆi can be obtained for each test sample can lead to effectively classification results. The original sparse yi by exploiting the OMP algorithm, which is to solve representation for face recognition [20] regards the raw data 2 as the dictionary, and then reports its promising classification xˆi = arg min yi D′x′ s.t. x′ 0 k (VIII.20) k − ik2 k ik ≤ results. In this section, a label consistent KSVD (LC-KSVD) The final classification is based on a simple linear predictive [135, 136] is introduced to learn an effective discriminative function dictionary for image classification. As an extension of D- KSVD, LC-KSVD exploits the supervised information to learn l = argmax f = C′ xˆi (VIII.21) f { × } the dictionary and integrates the process of constructing the dictionary and optimal linear classifier into a mixed recon- where f is the final predicting label vector and the test sample structive and discriminative objective function, and then jointly yi is classified as a member of the l-th class. obtains the learned dictionary and an effective classifier. The The main contribution of LC-KSVD is to jointly incorporate objective function of LC-KSVD is formulated as the discriminative sparse coding term and classifier parameter term into the objective function for learning a discriminative dictionary and classifier parameter. The LC-KSVD demon- 2 2 D,A,C,X = arg min Y DX F + µ L AX F strates that the obtained solution, compared to other methods, h i D,A,C,X k − k k − k can prevent learning a suboptimal or local optimal solution in +η H CX 2 s.t. x k F i 0 the process of learning a dictionary [135]. k − k k (VIII.17)k ≤ (3) Fisher discrimination dictionary learning for sparse repre- where the first term denotes the reconstruction error, the sentation second term denotes the discriminative sparse-code error, and Fisher discrimination dictionary learning (FDDL) [137] the final term denotes the classification error. Y is the matrix incorporates the supervised information (class label informa- composed of all the input data, D is the learned dictionary, tion) and the Fisher discrimination message into the objective X is the sparse code term, µ and η are the weights of the function for learning a structured discriminative dictionary, corresponding contribution items, A is a linear transformation which is used for pattern classification. The general model matrix, H is the matrix composed of label information corre- of FDDL is formulated as sponding to Y , C is the parameter term for classifier and L is a joint label matrix for labels of Y and D. For example, J(D,X) = arg min f(Y,D,X)+ µ X 1 + ηg(X) providing that Y = [y1 ...y4] and D = [d1 ...d4] where D,X{ k k } (VIII.22) y1,y2, d1 and d2 are from the first class, and y3,y4, d3 and where Y is the matrix composed of input data, D is the d4 are from the second class, and then the joint label matrix 1 1 0 0 learned dictionary, X is the sparse solution, and µ and η are 1 1 0 0 two constants for tradeoff contributions. The first component L can be defined as L = . Similar to the  0 0 1 1  is the discriminative fidelity term, the second component is  0 0 1 1  the sparse regularization term, and the third component is the   DKSVD algorithm, the objective function VIII.17 can also be discriminative coefficient term, such as Fisher discrimination reformulated as criterion in Eq. (VIII.23). JOURNAL 24

Considering the importance of the supervised information, information and incorporates it into the learning process to i.e. label information, in classification, FDDL respectively enforce the discrimination of the learned dictionary. Recently, updates the dictionary and computes the sparse representation massive supervised dictionary learning algorithms have been solution class by class. Assume that Y i denotes the matrix of proposed. For example, Yang et al. [139] presented a metaface i-th class of input data, vector Xi denotes the sparse repre- dictionary learning method, which is motivated by ‘metagenes’ sentation coefficient of the learned dictionary D over Y i and in gene expression data analysis. Rodriguez and Sapiro [140] i Xj denotes the matrix composed of the sparse representation produced a discriminative non-parametric dictionary learning solutions, which correspond to the j-th class coefficients from (DNDL) framework based on the OMP algorithm for image Xi. Di is denoted as the learned dictionary corresponding to classification. Kong et al. [141] introduced a learned dictionary the i-th class. Thus, the objective function of FDDL is with commonalty and particularity, called DL-COPAR, which integrated an incoherence penalty term into the objective c function for obtaining the class-specific sub-dictionary. Gao i i J(D,X) = arg min( f(Y ,D,X )+ µ X 1+ et al. [142] learned a hybrid dictionary, i.e. category-specific D,X k k i=1 dictionary and shared dictionary, which incorporated a cross- X 2 η(tr(SW (X) SB(X)) + λ X )) dictionary incoherence penalty and self-dictionary incoherence − k kF (VIII.23) penalty into the objective function for learning a discrim- i i i i 2 i i i 2 inative dictionary. Jafari and Plumbley [143] presented a where f(Y ,D,X ) = Y DX F + Y D Xi F + j i 2 k − k k − k greedy adaptive dictionary learning method, which updated the j=i D Xj F and SW (X) and SB(X) are the within-class scatter6 k of X andk between-class scatter of X, respectively. c is learned dictionary with a minimum sparsity index. Some other Pthe number of the classes. To solve problem VIII.23, a natural supervised dictionary learning methods are also competent in idea of optimization is to alternatively optimize D and X image classification, such as supervised dictionary learning in class by class, and then the process of optimization is briefly [144]. Zhou et al. [145] developed a joint dictionary learning introduced. algorithm for object categorization, which jointly learned a When fixing D, problem VIII.23 can be solved by com- commonly shared dictionary and multiply category-specific puting Xi class by class, and its sub-problem is formulated dictionaries for correlated object classes and incorporated the as Fisher discriminant fidelity term into the process of dictionary learning. Ramirez et al. proposed a method of dictionary i i i i i J(X ) = arg min f(Y ,D,X )+ µ X 1 + ηg(X ) learning with structured incoherence (DLSI) [140], which Xi k k unified the dictionary learning and sparse decomposition into a (VIII.24) where g(Xi)= Xi M 2 c M M 2 + λ Xi 2 sparse dictionary learning framework for image classification k − ikF − t=1 k t − kF k kF and Mj and M denote the mean matrices corresponding to and data clustering. Ma et al. presented a discriminative low- the j-th class of Xi and Xi, respectively.P Problem VIII.23 can rank dictionary learning for sparse representation (DLRD SR) be solved by the iterative projection method in the literature [146], in which the sparsity and the low-rank properties were [138]. integrated into one dictionary learning scheme where sub- When fixing α, problem VIII.23 can be rewritten as dictionary with discriminative power was required to be low- rank. Lu et al. developed a simultaneous feature and dictionary i i i j j 2 learning [147] method for face recognition, which jointly J(D) = arg min( Y D X D X F + D k − − k learned the feature projection matrix for subspace learning and j=i X6 (VIII.25) the discriminative structured dictionary. Yang et al. introduced Y i DiXi 2 + DiXi 2 ) k − i kF k j kF a latent dictionary learning (LDL) [148] method for sparse rep- j=i X6 resentation based image classification, which simultaneously where Xi here denotes the sparse representation of Y over learned a discriminative dictionary and a latent representation Di. In this section, each column of the learned dictionary is model based on the correlations between label information normalized to a unit vector with l2-norm. The optimization of and dictionary atoms. Jiang et al. presented a submodular problem VIII.25 computes the dictionary class by class and dictionary learning (SDL) [149] method, which integrated the it can be solved by exploiting the algorithm in the literature entropy rate of a random walk on a graph and a discriminative [139]. term into a unified objective function and devised a greedy- The main contribution of the FDDL algorithm lies in based approach to optimize it. Si et al. developed a support combining the Fisher discrimination criterion into the process vector guided dictionary learning (SVGDL) [150] method, of dictionary learning. The discriminative power comes from which constructed a discriminative term by using adaptively the method of constructing the discriminative dictionary using weighted summation of the squared distances for all pairwise the function f in problem VIII.22 and simultaneously formu- of the sparse representation solutions. lates the discriminative sparse representation coefficients by exploiting the function g in problem VIII.22. (4) Other supervised dictionary learning for sparse represen- B. Sparse representation in image processing tation Recently, sparse representation methods have been extensively Unlike unsupervised dictionary learning, supervised dictio- applied to numerous real-world applications [151, 152]. The nary learning emphasizes the significance of the class label techniques of sparse representation have been gradually ex- JOURNAL 25 tended and introduced to image processing, such as super- the desired representation solution α is sufficiently sparse, resolution image processing, image denoising and image problem VIII.27 can be converted into the following problem: restoration. arg min α s.t. x D α 2 ε (VIII.28) First, the general framework of image processing us- k k1 k − h k2 ≤ ing sparse representation especially for image reconstruction or should be introduced: 2 arg min x Dhα 2 + λ α 1 (VIII.29) Step 1: Partition the degraded image into overlapped patches k − k k k or blocks. where ε is a small constant and λ is the Lagrange multiplier. Step 2: Construct a dictionary, denoted as D, and assume The solution of problem VIII.28 can be achieved by two main that the following sparse representation formulation should be phases, i.e. local model based sparse representation (LMBSR) satisfied for each patch or block x of the image: and enhanced global reconstruction constraint. The first phase 2 of SRSR, i.e. LMBSR, is operated on each image patch, and αˆ = arg min α p s.t. x HDα 2 ε where H is ak degradationk matrixk − and 0k ≤p 1. for each low-resolution image patch y, the following equation ≤ ≤ Step 3: Reconstruct each patch or block by exploiting xˆ = is satisfied Dαˆ. arg min F y FD α 2 + λ α (VIII.30) Step 4: Put the reconstructed patch to the image at the k − l k2 k k1 corresponding location and average each overlapped patches where F is a feature extraction operator. One-pass algorithm to make the reconstructed image more consistent and natural. similar to that of [154] is introduced to enhance the com- Step 5: Repeat step 1 to 4 several times till a termination patibility between adjacent patches. Furthermore, a modified condition is satisfied. optimization problem is proposed to guarantee that the super- The following part of this subsection is to explicitly intro- resolution reconstruction coincides with the previously ob- duce some image processing techniques using sparse repre- tained adjacent high-resolution patches, and the problem is sentation. reformulated as The main task of super-resolution image processing is to 2 2 extract the high super-resolution image from its low resolution arg min α 1 s.t. F y FDlα 2 ε1; v LDhα 2 ε2 k k k − k ≤ k − k ≤ counterpart and this challenging problem has attracted much (VIII.31) attention. The most representative work was proposed to where v is the previously obtained high-resolution image on exploit the sparse representation theory to generate a super- the overlap region, and L refers to the region of overlap resolution (SRSR) image from a single low-resolution image between the current patch and previously obtained high- in literature [153]. resolution image. Thus problem VIII.31 can be rewritten as SRSR is mainly performed on two compact learned dic- 2 arg min yˆ Dα 2 + λ α 1 (VIII.32) tionaries Dl and Dh, which are denoted as dictionaries of k − k k k F y FD low-resolution image patches and its corresponding high- where yˆ = and D = l . Problem VIII.32 resolution image patches, respectively. Dl is directly employed v LDh can be simply solved by previously introduced solution of the to recover high-resolution images from dictionary Dh. Let X and Y denote the high-resolution and its corresponding low- sparse representation with l1-norm minimization. Assume that resolution images, respectively. x and y are a high-resolution the optimal solution of problem VIII.32, i.e. α∗, is achieved, image patch and its corresponding low-resolution image patch, the high-resolution patch can be easily reconstructed by x = respectively. Thus, x = P y and P is the projection matrix. Dhα∗. Moreover, if the low resolution image Y is produced by down- The second phase of SRSR enforces the global reconstruc- sampling and blurring from the high resolution image X, the tion constraint to eliminate possible unconformity or noise following reconstruction constraint should be satisfied from the first phase and make the obtained image more consistent and compatible. Suppose that the high-resolution Y = SBX (VIII.26) image obtained by the first phase is denoted as matrix X0, we project X onto the solution space of the reconstruction where S and B are a downsampling operator and a blurring 0 constraint VIII.26 and the problem is formulated as follows filter, respectively. However, the solution of problem VIII.26 2 is ill-posed because infinite solutions can be achieved for a X∗ = arg min X X0 2 s.t. Y = SBX (VIII.33) given low-resolution input image Y . To this end, SRSR [153] k − k provides a prior knowledge assumption, which is formulated Problem VIII.33 can be solved by the back-projection method as in [155] and the obtained image X∗ is regarded as the final optimal high-resolution image. The entire super-resolution via x = D α s.t. α k (VIII.27) h k k0 ≤ sparse representation is summarized in algorithm 14 and more where k is a small constant. This assumption gives a prior information can be found in the literature [153]. knowledge condition that any image patch x can be approxi- Furthermore, extensive other methods based on sparse rep- mately represented by a linear combination of a few training resentation have been proposed to solve the super-resolution samples from dictionary Dh. As presented in Subsection image processing problem. For example, Yang et al. pre- III-B, problem VIII.27 is an NP-hard problem and sparse sented a modified version called joint dictionary learning via representation with l1-norm regularization is introduced. If sparse representation (JDLSR) [156], which jointly learned JOURNAL 26

Algorithm 14. Super-resolution via sparse representation [4, 7], sparse representation for image denoising first extracts

Input: Training image patches dictionaries Dl and Dh, a low-resolution the sparse image components, which are regarded as useful image Y . information, and then abandons the representation residual, For each overlapped 3 3 patches y of Y using one-pass algorithm, from left to right and top to× bottom which is treated as the image noise term, and finally recon- Step 1: Compute optimal sparse representation coefficients α∗ in structs the image exploiting the pre-obtained sparse compo- problem (VIII.32). nents, i.e. noise-free image. Extensive research articles for Step 2: Compute the high-resolution patch by x = Dhα∗. image denoising based on sparse representation have been Step 3: Put the patch x into a high-resolution image X0 in corresponding location. published. For example, Donoho [8, 29, 165] first discovered End the connection between the compressed sensing and image Step 4: Compute the final super-resolution image X∗ in problem denoising. Subsequently, the most representative work of using (VIII.33). Output: X∗ sparse representation to make image denoising was proposed in literature [166], in which a global sparse representation model over learned dictionaries (SRMLD) was used for image denoising. The following prior assumption should be satisfied: two dictionaries that enforced the similarity of sparse repre- every image block of image x, denoted as z, can be sparsely sentation for low-resolution and high-resolution images. Tang represented over a dictionary D, i.e. the solution of the et al. [157] first explicitly analyzed the rationales of the following problem is sufficiently sparse: sparse representation theory in performing the super-resolution arg min α 0 s.t. Dα = z (VIII.34) task, and proposed to exploit the L2-Boosting strategy to α k k learn coupled dictionaries, which were employed to construct And an equivalent problem can be reformulated for a proper sparse coding space. Zhang et al. [158] presented an image value of λ, i.e. super-resolution reconstruction scheme by employing the dual- 2 dictionary learning and sparse representation method for image arg min Dα z + λ α 0 (VIII.35) α k − k2 k k super-resolution reconstruction and Gao et al. [159] proposed If we take the above prior knowledge into full consideration, a sparse neighbor embedding method, which incorporated the the objective function of SRMLD based on Bayesian treatment sparse neighbor search and HoG clustering method into the is formulated as process of image super-resolution reconstruction. Fernandez- M M Granda and Candes [160] designed a transform-invariant 2 2 arg min δ x y 2 + Dαi Pix 2 + λi αi 0 group sparse regularizer by implementing a data-driven non- D,αi,x k − k k − k k k i=1 i=1 parametric regularizers with learned domain transform on X X (VIII.36) group sparse representation for high image super-resolution. where x is the finally denoised image, y the measured Lu et al. [161] proposed a geometry constrained sparse image with white and additive Gaussian white noise, Pi is representation method for single image super-resolution by a projection operator that extracts the i-th block from image jointly obtaining an optimal sparse solution and learning a x, M is the number of the overlapping blocks, D is the learned discriminative and reconstructive dictionary. Dong et al. [162] dictionary, αi is the coefficients vector, δ is the weight of the proposed to harness an adaptive sparse optimization with first term and λi is the Lagrange multiplier. The first term nonlocal regularization based on adaptive principal component in VIII.36 is the log-likelihood global constraint such that analysis enhanced by nonlocal similar patch grouping and the obtained noise-free image x is sufficiently similar to the nonlocal self-similarity quadratic constraint to solve the image original image y. The second and third terms are the prior high super-resolution problem. Dong et al. [163] proposed to knowledge of the Bayesian treatment, which is presented in integrate an adaptive sparse domain selection and an adaptive problem VIII.35. The optimization of problem VIII.35 is a regularization based on piecewise autoregressive models into joint optimization problem with respect to D, αi and x. It can the sparse representations framework for single image super- be solved by alternatively optimizing one variable when fixing resolution reconstruction. Mallat and Yu [164] proposed a the others. The process of optimization is briefly introduced sparse mixing estimator for image super-resolution, which below. introduced an adaptive estimator models by combining a group When dictionary D and the solution of sparse representation of linear inverse estimators based on different prior knowledge αi are fixed, problem VIII.36 can be rewritten as for sparse representation. Noise in an image is unavoidable in the process of image M 2 2 acquisition. The need for sparse representation may arise when arg min δ x y 2 + Dαi z 2 (VIII.37) x k − k k − k i=1 noise exists in image data. In such a case, the image with X noise may lead to missing information or distortion such that where z = Pix. Apparently, problem VIII.37 is a simple this results in a decrease of the precision and accuracy of convex optimization problem and has a closed-form solution, image processing. Eliminating such noise is greatly beneficial which is given by to many applications. The main goal of image denoising is 1 1 M − M − to distinguish the actual signal and noise signal so that we T T x = P Pi + δI P Dαi + δy can remove the noise and reconstruct the genuine image. In i i i=1 ! i=1 ! the presence of image sparsity and redundancy representation X X (VIII.38) JOURNAL 27

When x is given, problem (VIII.36) can be written as patches in both spatial and temporal domain to formulate a

M M low-rank matrix problem with the nuclear norm. Cheng et al. 2 [174] proposed an impressive image denoising method based arg min Dαi Pix 2 + λi αi 0 (VIII.39) D,αi k − k k k i=1 i=1 on an extension of the KSVD algorithm via group sparse X X representation. where the problem can be divided into M sub-problems and The primary purpose of image restoration is to recover the the i-th sub-problem can be reformulated as the following original image from the degraded or blurred image. The sparse dictionary learning problem: representation theory has been extensively applied to image 2 arg min Dαi z 2 s.t. αi 0 τ (VIII.40) restoration. For example, Bioucas-Dias and Figueirdo [175] D,αi k − k k k ≤ introduced a two-step iterative shrinkage/thresholding (TwIST) where z = Pix and τ is small constant. One can see algorithm for image restoration, which is more efficient and that the sub-problem VIII.39 is the same as problem VIII.2 can be viewed as an extension of the IST method. Mairal et and it can be solved by the KSVD algorithm previously al. [176] presented a multiscale sparse image representation presented in Subsection VIII-A2. The algorithm of image framework based on the KSVD dictionary learning algorithm denoising exploiting sparse and redundant representation over and shift-invariant sparsity prior knowledge for restoration of learned dictionary is summarized in Algorithm 15, and more color images and video image sequence. Recently, Mairal information can be found in literature [166]. et al. [177] proposed a learned simultaneous sparse coding (LSSC) model, which integrated sparse dictionary learning Algorithm 15. Image denoising via sparse and redundant representation and nonlocal self-similarities of natural images into a unified over learned dictionary framework for image restoration. Zoran and Weiss [178] pro- Task: To denoise a measured image y from white and additional Gaussian white noise: posed an expected patch log likelihood (EPLL) optimization 2 M 2 M arg minD,α ,x δ x y + Dαi Pix + λi αi 0 model, which restored the image from patch to the whole i k − k2 Pi=1 k − k2 Pi=1 k k Input: Measured image sample y, the number of training iteration T . image based on the learned prior knowledge of any patch Initialization: t = 1, set x = y, D initialized by an overcomplete DCT acquired by Maximum A-Posteriori estimation instead of using dictionary. simple patch averaging. Bao et al. [179] proposed a fast While t T do ≤ Step 1: For each image patch Pix, employ the KSVD algorithm to orthogonal dictionary learning algorithm, in which a sparse update the values of sparse representation solution αi and corresponding image representation based orthogonal dictionary was learned dictionary D. in image restoration. Zhang et al. [180] proposed a group- Step 2: t = t + 1 End While based sparse representation, which combined characteristics Step 3: Compute the value of x by using Eq.(VIII.38). from local sparsity and nonlocal self-similarity of natural Output: denoised image x images to the domain of the group. Dong et al. [181, 182] pro- posed a centralized sparse representation (CSR) model, which Moreover, extensive modified sparse representation based combined the local and nonlocal sparsity and redundancy image denoising algorithms have been proposed. For example, properties for variational problem optimization by introducing Dabov et al. [167] proposed an enhanced sparse representation a concept of sparse coding noise term. with a block-matching 3-D (BM3D) transform-domain filter Here we mainly introduce a recently proposed simple but based on self-similarities and an enhanced sparse represen- effective image restoration algorithm CSR model [181]. For a tation by clustering similar 2-D image patches into 3-D data degraded image y, the problem of image restoration can be spaces and an iterative collaborative filtering procedure for im- formulated as age denoising. Mariral et al. [168] proposed the use of extend- ing the KSVD-based grayscale algorithm and a generalized y = Hx + v (VIII.41) weighted average algorithm for color image denoising. Protter and Elad [169] extended the techniques of sparse and redun- where H is a degradation operator, x is the original high- dant representations for image sequence denoising by exploit- quality image and v is the Gaussian white noise. Suppose that ing spatio-temporal atoms, dictionary propagation over time the following two sparse optimization problems are satisfied and dictionary learning. Dong et al. [170] designed a clustering 2 based sparse representation algorithm, which was formulated αx = arg min α 1 s.t. x Dα 2 ε (VIII.42) k k k − k ≤ by a double-header sparse optimization problem built upon α = arg min α s.t. x HDα 2 ε (VIII.43) dictionary learning and structural clustering. Recently, Jiang y k k1 k − k2 ≤ et al. [171] proposed a variational encoding framework with where y and x respectively denote the degraded image and a weighted sparse nonlocal constraint, which was constructed original high-quality image, and ε is a small constant. A new by integrating image sparsity prior and nonlocal self-similarity concept called sparse coding noise (SCN) is defined prior into a unified regularization term to overcome the mixed noise removal problem. Gu et al. [172] studied a weighted vα = αy αx (VIII.44) nuclear norm minimization (WNNM) method with F -norm − fidelity under different weighting rules optimized by non-local Given a dictionary D, minimizing SCN can make the image self-similarity for image denoising. Ji et al. [173] proposed better reconstructed and improve the quality of the image a patch-based video denoising algorithm by stacking similar restoration because x∗ = xˆ x˜ = Dα Dα = Dv . − y − x α JOURNAL 28

Thus, the objective function is reformulated as Algorithm 16. Centralized sparse representation for image restoration

2 Initialization: Set x = y, initialize regularization parameter λ and µ, the αy = arg min y HDα 2 + λ α 1 + µ α αx 1 number of training iteration T , t = 0, θ0 = 0. α k − k k k k − k (VIII.45) Step 1: Partition the degraded image into M overlapped patches. While t T do where λ and µ are both constants. However, the value of Step 2: For≤ each image patch, update the corresponding dictionary for each αx is difficult to directly evaluate. Because many nonlocal cluster via k-means and PCA. similar patches are associated with the given image patch , Step 3: Update the regularization parameters λ and µ by using i 2 2 λ = 2√2ρ and µ = 2√2ρ . clustering these patches via block matching is advisable and σt ηt Step 4: Compute the nonlocal means estimation of the unbiased estimation the sparse code of searching similar patch l to patch i in t+1 of αx, i.e. θi , by using Eq. (VIII.46) for each image patch. cluster Ωi, denoted by αil, can be computed. Moreover, the t+1 Step 5: For a given θi , compute the sparse representation solution, t+1 unbiased estimation of αx, denoted by E[αx], empirically can i.e. αy , in problem (VIII.48) by using the extended iterative shrinkage be approximate to αx under some prior knowledge [181], and algorithm in literature [185]. then SCN algorithm employs the nonlocal means estimation Step 6: t = t + 1 End While method [183] to evaluate the unbiased estimation of αx, that t+1 Output: Restored image x = Dαy is, using the weighted average of all αil to approach E[αx], i.e. Experimental results have suggested that the sparse representa- θi = wilαil (VIII.46) tion based classification method can somewhat overcome the l Ωi X∈ challenging issues from illumination changes, random pixel where w = exp x x 2/h /N, x = Dα , x = corruption, large block occlusion or disguise. il −k i − ilk2 i i il Dαil, N is a normalization parameter and h is a constant. As face recognition is a representative component of pattern Thus, the objective function VIII.45 can be rewritten as recognition and computer vision applications, the applications of sparse representation in face recognition can sufficiently M 2 reveal the potential nature of sparse representation. The most αy = arg min y HDα + λ α 1 + µ αi θi 1 α k − k2 k k k − k representative sparse representation for face recognition has i=1 X (VIII.47) been presented in literature [18] and the general scheme of where M is the number of the separated patches. In the sparse representation based classification method is summa- j-th iteration, the solution of problem VIII.47 is iteratively rized in Algorithm 17. Suppose that there are n training performed by samples, X = [x , x , , x ] from c classes. Let X denote 1 2 ··· n i M the samples from the i-th class and the testing sample is y. j+1 2 j αy = arg min y HDα 2 + λ α 1 + µ αi θi 1 α k − k k k k − k Algorithm 17. The scheme of sparse representation based classification i=1 X (VIII.48) method It is obvious that problem VIII.47 can be optimized by the Step 1: Normalize all the samples to have unit l2-norm. augmented Lagrange multiplier method [184] or the iterative Step 2: Exploit the linear combination of all the training samples to represent the test sample and the following l1-norm minimization problem shrinkage algorithm in [185]. According to the maximum is satisfied 2 average posterior principle and the distribution of the sparse α = arg min α 1 s.t. y Xα ε. ∗ k k k − k2 ≤ coefficients, the regularization parameter λ and constant µ can Step 3: Compute the representation residual for each class 2 2 r = y X α 2 be adaptively determined by λ = 2√2ρ and µ = 2√2ρ , i k − i i∗k2 σi ηi where αi∗ here denotes the representation coefficients vector associated where ρ, σi and ηi are the standard deviations of the additive with the i-th class. Gaussian noise, α and the SCN signal, respectively. More- Step 4: Output the identity of the test sample y by judging i label(y) = arg min (r ). over, in the process of image patches clustering for each given i i image patch, a local PCA dictionary is learned and employed to code each patch within its corresponding cluster. The main Numerous sparse representation based classification meth- procedures of the CSR algorithm are summarized in Algorithm ods have been proposed to improve the robustness, effec- 16 and readers may refer to literature [181] for more details. tiveness and efficiency of face recognition. For example, Xu et al. [9] proposed a two-phase sparse representation based classification method, which exploited the l2-norm regular- C. Sparse representation in image classification and visual ization rather than the l1-norm regularization to perform a tracking coarse to fine sparse representation based classification, which In addition to these effective applications in image process- was very efficient in comparison with the conventional l1- ing, several other fields for sparse representation have been norm regularization based sparse representation. Deng et al. proposed and extensively studied in image classification and [186] proposed an extended sparse representation method visual tracking. Since Wright et al. [20] proposed to employ (ESRM) for improving the robustness of SRC by eliminating sparse representation to perform robust face recognition, more the variations in face recognition, such as disguise, occlu- and more researchers have been applying the sparse represen- sion, expression and illumination. Deng et al. [187] also tation theory to the fields of computer vision and pattern recog- proposed a framework of superposed sparse representation nition, especially in image classification and object tracking. based classification, which emphasized the prototype and vari- JOURNAL 29 ation components from uncontrolled images. He et al. [188] representation methods have been proposed to address the proposed utilizing the maximum correntropy criterion named visual tracking problem. In order to design an accelerated CESR embedding non-negative constraint and half-quadratic algorithm for l1 tracker, Li et al. [203] proposed two real- optimization to present a robust face recognition algorithm. time compressive sensing visual tracking algorithms based Yang et al. [189] developed a new robust sparse coding on sparse representation, which adopted dimension reduction (RSC) algorithm, which first obtained a sparsity-constrained and the OMP algorithm to improve the efficiency of recovery regression model based on maximum likelihood estimation and procedure in tracking, and also developed a modified version exploited an iteratively reweighted regularized robust coding of fusing background templates into the tracking procedure algorithm to solve the pre-proposed model. Some other sparse for robust object tracking. Zhang et al. [204] directly treated representation based image classification methods also have object tracking as a pattern recognition problem by regarding been developed. For example, Yang et al. [190] introduced an all the targets as training samples, and then employed the extension of the spatial pyramid matching (SPM) algorithm sparse representation classification method to do effective called ScSPM, which incorporated SIFT sparse representation object tracking. Zhang et al. [205] employed the concept into the spatial pyramid matching algorithm. Subsequently, of sparse representation based on a particle filter framework Gao et al. [191] developed a kernel sparse representation to construct a multi-task sparse learning method denoted as with the SPM algorithm called KSRSPM, and then proposed multi-task tracking for robust visual tracking. Additionally, another version of an improvement of the SPM called LSc- because of the discriminative sparse representation between SPM [192], which integrated the Laplacian matrix with local the target and the background, Jia et al. [206] conceived a features into the objective function of the sparse representation structural local sparse appearance model for robust object method. Kulkarni and Li [193] proposed a discriminative tracking by integrating the partial and spatial information from affine sparse codes method (DASC) on a learned affine- the target based on an alignment-pooling algorithm. Liu et al. invariant feature dictionary from input images and exploited [207] proposed constructing a two-stage sparse optimization the AdaBoost-based classifier to perform image classification. based online visual tracking method, which jointly minimized Zhang et al. [194] proposed integrating the non-negative the objective reconstruction error and maximized the discrim- sparse coding, low-rank and decomposition inative capability by choosing distinguishable features. Liu et (LR-Sc+SPM) method, which exploited non-negative sparse al. [208] introduced a local sparse appearance model (SPT) coding and SPM for achieving local features representation with a static sparse dictionary learned from k-selection and and employed low-rank and sparse for dynamic updated basis distribution to eliminate potential drift- sparse representation, for image classification. Recently, Zhang ing problems in the process of visual tracking. Bao et al. [209] et al. [195] presented a low-rank sparse representation (LRSR) developed a fast real time l1-tracker called the APG-l1tracker, learning method, which preserved the sparsity and spatial which exploited the accelerated proximal gradient algorithm to consistency in each procedure of feature representation and improve the l1-tracker solver in [201]. Zhong et al. [210] ad- jointly exploited local features from the same spatial proximal dressed the object tracking problem by developing a sparsity- regions for image classification. Zhang et al. [196] developed based collaborative model, which combined a sparsity-based a structured low-rank sparse representation (SLRSR) method classifier learned from holistic templates and a sparsity-based for image classification, which constructed a discriminative template model generated from local representations. Zhang et dictionary in training terms and exploited low-rank matrix al. [211] proposed to formulate a sparse feature measurement reconstruction for obtaining discriminative representations. matrix based on an appearance model by exploiting non- Tao et al. [197] proposed a novel dimension reduction method adaptive random projections, and employed a coarse-to-fine based on the framework of rank preserving sparse learning, strategy to accelerate the computational efficiency of tracking and then exploited the projected samples to make effective task. Lu et al. [212] proposed to employ both non-local self- Kinect-based scene classification. Zhang et al. [198] proposed similarity and sparse representation to develop a non-local self- a discriminative tensor sparse coding (RTSC) method for similarity regularized sparse representation method based on robust image classification. Recently, low-rank based sparse geometrical structure information of the target template data representation became a popular topic such as non-negative set. Wang et al. [213] proposed a sparse representation based low-rank and sparse graph [199]. Some sparse representation online two-stage tracking algorithm, which learned a linear methods in face recognition can be found in a review [83] classifier based on local sparse representation on favorable and other more image classification methods can be found in image patches. More detailed visual tracking algorithms can a more recent review [200]. be found in the recent reviews [214, 215]. Mei et al. employed the idea of sparse representation to visual tracking [201] and vehicle classification [202], which IX. EXPERIMENTAL EVALUATION introduced nonnegative sparse constraints and dynamic tem- plate updating strategy. It, in the context of the particle filter In this section, we take the object categorization problem framework, exploited the sparse technique to guarantee that as an example to evaluate the performance of different sparse each target candidate could be sparsely represented using the representation based classification methods. We analyze and linear combinations of fewest targets and particle templates. compare the performance of sparse representation with the It also demonstrated that sparse representation can be propa- most typical algorithms: OMP [36], l1 ls [76], PALM [89], gated to address object tracking problems. Extensive sparse FISTA [82], DALM [89], homotopy [99] and TPTSR [9]. JOURNAL 30

94 L1_LS L1_LS FISTA FISTA DALM 93 40 DALM Homotopy Homotopy TPTSR TPTSR 92

91 35

90

89 30

88

87 Classification accuracy (%) Classification accuracy (%) 25

86

85

20 84 1 5 1 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 0.01 0.02 0.03 0.04 0.05 100 0.01 0.02 0.03 0.04 0.05 100 0.001 0.005 0.015 0.025 0.035 0.045 0.001 0.005 0.015 0.025 0.035 0.045 0.0001 0.0005 0.0001 0.0005 The value of the regularization parameter λ The value of the regularization parameter λ (a) Parameter selection on ORL (b) Parameter selection on LFW

88 64

L1_LS 87 FISTA DALM 62 Homotopy 86 TPTSR

60 85

84 58 L1_LS 83 FISTA DALM Homotopy

Classification accuracy (%) 56 Classification accuracy (%) 82 TPTSR

81 54

80 1 5 1 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 0.01 0.02 0.03 0.04 0.05 100 0.01 0.02 0.03 0.04 0.05 100 0.001 0.005 0.015 0.025 0.035 0.045 0.001 0.005 0.015 0.025 0.035 0.045 0.0001 0.0005 The value of the regularization parameter λ 0.0001 0.0005 The value of the regularization parameter λ (c) Parameter selection on Coil 20 (d) Parameter selection on 15 scene Fig. 5: Classification accuracies of using different sparse representation based classification methods versus varying values of the regularization parameter λ on the (a) ORL (b) LFW (c) Coil20 and (d) Fifteen scene datasets.

Plenties of data sets have been collected for object cate- pictured have two or more distinct photos in the database. gorization, especially for image classification. Several image In our experiments, we chose 1251 images from 86 peoples data sets are used in our experimental evaluations. and each subject has 10-20 images [218]. Each image was manually cropped and was resized to 32 32 pixels. ORL: The ORL database includes 400 face images taken × from 40 subjects each providing 10 face images [216]. For Extended YaleB face dataset: The extended YaleB database some subjects, the images were taken at different times, with contains 2432 front face images of 38 individuals and each varying lighting, facial expressions, and facial details. All the subject having around 64 near frontal images under different images were taken against a dark homogeneous background illuminations [219]. The main challenge of this database is with the subjects in an upright, frontal position (with tolerance to overcome varying illumination conditions and expressions. for some side movement). Each image was resized to a 56 46 The facial portion of each original image was cropped to a image matrix by using the down-sampling algorithm. × 192 168 image. All images in this data set for our experi- ments× simply resized these face images to 32 32 pixels. LFW face dataset: The Labeled Faces in the Wild (LFW) × face database is designed for the study of unconstrained COIL20 dataset: Columbia Object Image Library (COIL- identity verification and face recognition [217]. It contains 20) database consists of 1,440 size normalized gray-scale im- more than 13,000 images of faces collected from the web ages of 20 objects [220]. Different object images are captured under the unconstrained conditions. Each face has been labeled at every angle in a 360 rotation. Images of the objects were with the name of the people pictured. 1680 of the people taken from varying angles at pose intervals of five degrees and JOURNAL 31 each object has 72 images. the experiments are repeated 10 times with the optimal pa- Fifteen scene dataset: This dataset contains 4485 images rameter obtained using the cross validation approach. The under 15 natural scene categories presented in literature [221] gray-level features of all images in these data sets are used and each category includes 210 to 410 images. The 15 scenes to perform classification. For the sake of computational ef- categories are office, kitchen, living room, bedroom, store, ficiency, principle component analysis algorithm is used as industrial, tall building, inside cite, street, highway, coast, open a preprocessing step to preserve 98% energy of all the data country, mountain, forest and suburb. A wide range of outdoor sets. The classification results and computational time have and indoor scenes are included in this dataset. The average been summarized in Table I. From the experimental results image size is around 250 300 pixels and the spatial pyramid on different databases, we can conclude that there still does matching features are used× in our experiments. not exist one extraordinary algorithm that can achieve the best classification accuracy on all databases. However, some algorithms are noteworthy to be paid much more attention. A. Parameter selection For example, the l1 ls algorithm in most cases can achieve Parameter selection, especially selection of the regularization better classification results than the other algorithms on the parameter λ in different minimization problems, plays an ORL database, and when the number of training samples of important role in sparse representation. In order to make fair each class is five, the l1 ls algorithm can obtain the highest comparisons with different sparse representation algorithms, classification result of 95.90%. The TPTSR algorithm is very performing the optimal parameter selection for different sparse computationally efficient in comparison with other sparse representation algorithms on different datasets is advisable representation with l1-norm minimization algorithms and the and indispensable. In this subsection, we perform extensive classification accuracies obtained by the TPTSR algorithm are experiments for selecting the best value of the regularization very similar and sometimes even better than the other sparse parameter λ with a wide range of options. Specifically, we representation based classification algorithms. implement the l1 ls, FISTA, DALM, homotopy and TPTSR The computational time is another indicator for measuring algorithms on different databases to analyze the importance of the performance of one specific algorithm. As shown in the regularization parameter. Fig. 5 summarizes the classifi- Table I, the average computational time of each algorithm is cation accuracies of exploiting different sparse representation shown at the bottom of the table for one specific number of based classification methods with varying values of regular- training samples. Note that the computational time of OMP ization parameter λ on the two face datasets, i.e. ORL and and TPTSR algorithms are drastically lower than that of other LFW face datasets, and two object datasets, i.e. COIL20 and sparse representation with l1-norm minimization algorithms. Fifteen scene datasets. On the ORL and LFW face datasets, This is mainly because the sparse representation with l1- we respectively selected the first five and eight face images of norm minimization algorithms always iteratively solve the l1- each subject as training samples and the rest of image samples norm minimization problem. However, the OMP and TPTSR for testing. As for the experiments on the COIL20 and fifteen algorithms both exploit the fast and efficient least squares scene datasets, we respectively treated the first ten images of technique, which guarantees that the computational time is each subject in both datasets as training samples and used all significantly less than other l1-norm based sparse representa- the remaining images as test samples. Moreover, from Fig. 5, tion algorithms. one can see that the value of regularization parameter λ can significantly dominate the classification results, and the values C. Discussion of λ for achieving the best classification results on different datasets are distinctly different. An interesting scenario is Lots of sparse representation methods have been available that the performance of the TPTSR algorithm is almost not in past decades and this paper introduces various sparse influenced by the variation of regularization parameter λ in representation methods from some viewpoints, including their the experiments on fifteen scene dataset, as shown in Fig. motivations, mathematical representations and the main al- 5(d). However, the best classification accuracy can be always gorithms. Based on the experimental results summarized in obtained within the range of 0.0001 to 1. Thus, the value of the Section IX, we have the following observations. regularization parameter is set within the range from 0.0001 First, a challenging task of choosing a suitable regularization to 1. parameter for sparse representation should make further exten- sive studies. We can see that the value of the regularization parameter can remarkably influence the performance of the B. Experimental results sparse representation algorithms and adjusting the parameters In order to test the performance of different kinds of sparse in sparse representation algorithms requires expensive labor. representation methods, an empirical study of experimental Moreover, adaptive parameter selection based sparse represen- results is conducted in this subsection and seven typical sparse tation methods is preferable and very few methods have been representation based classification methods are selected for proposed to solve this critical issue. performance evaluation followed with extensive experimental Second, although sparse representation algorithms have results. For all datasets, following most previous published achieved distinctly promising performance on some real-world work, we randomly choose several samples of every class databases, many efforts should be made in promoting the as training samples and used the rest as test samples and accuracy of sparse representation based classification, and the JOURNAL 32

Data set (#Tr) OMP l1 ls PALM FISTA DALM Homotopy TPTSR ORL(1) 64.94 2.374 68.50 2.021 68.36 1.957 70.67 2.429 70.22 2.805 66.53 1.264 71.56 3.032 ORL(2) 80.59±2.256 84.84±2.857 80.66±2.391 84.72±3.242 84.38±2.210 83.88±2.115 83.38±2.019 ORL(3) 89.00±1.291 89.71±1.313 86.82±1.959 90.00±3.141 90.36±1.829 89.32±1.832 90.71±1.725 ORL(4) 91.79±1.713 94.83±1.024 88.63±2.430 94.13±1.310 94.71±1.289 94.38±1.115 94.58±1.584 ORL(5) 93.75±2.125 95.90±1.150 92.05±1.039 95.60±1.761 95.50±1.269 95.60±1.430 95.75±1.439 ORL(6) 95.69±1.120 97.25±1.222 92.06±1.319 96.69±1.319 96.56±1.724 97.31±1.143 95.81±1.642 Average Time(5) 0.0038s± 0.1363s± 7.4448s± 0.9046s± 0.8013s± 0.0100s± 0.0017s± LFW(3) 22.22 1.369 28.24 0.667 15.16 1.202 26.55 0.767 25.46 0.705 26.12 0.831 27.32 1.095 LFW(5) 27.83±1.011 35.58±1.489 12.89±1.286 34.13±0.459 33.90±1.181 33.95±1.680 35.43±1.409 LFW(7) 32.76±2.318 40.17±2.061 11.63±0.937 39.86±1.226 38.40±1.890 38.04±1.251 40.92±1.201 LFW(9) 35.14±1.136 44.93±1.123 7.84 ±1.278 43.86±1.492 43.56±1.393 42.29±2.721 44.72±1.793 Average Time(7) 0.0140s± 0.6825s± 33.2695s± 3.0832s± 3.9906s± 0.2372s± 0.0424s± Extended YaleB(3) 44.20 2.246 63.10 2.341 63.73 2.073 62.84 2.623 63.76 2.430 64.22 2.525 56.23 2.153 Extended YaleB(6) 72.48±2.330 81.97±0.850 81.93±0.930 82.25±0.734 81.74±1.082 81.64±1.159 78.53±1.731 Extended YaleB(9) 83.42±0.945 88.90±0.544 88.50±1.096 89.31±0.829 89.26±0.781 89.12±0.779 86.49±1.165 Extended YaleB(12) 88.23±0.961 92.49±0.622 91.07±0.725 92.03±1.248 91.85±0.710 92.03±0.767 91.30±0.741 Extended YaleB(15) 91.97±0.963 94.22±0.719 93.19±0.642 94.50±0.824 93.07±0.538 93.67±0.860 93.38±0.785 Average Time(12) 0.0116s± 3.2652s± 17.4516s± 1.5739s± 1.9384s± 0.5495s± 0.0198s± COIL20(3) 75.90 1.656 77.62 2.347 70.26 2.646 75.80 2.056 76.67 2.606 78.46 2.603 78.16 2.197 COIL20(5) 83.00±1.892 82.63±1.701 79.55±1.153 84.09±2.003 84.38±1.319 84.58±1.487 83.69±1.804 COIL20(7) 87.26±1.289 88.22±1.304 82.88±1.445 88.89±1.598 89.00±1.000 89.36±1.147 87.75±1.451 COIL20(9) 89.56±1.763 90.97±1.595 84.94±1.563 90.16±1.366 91.82±1.555 91.44±1.198 89.41±2.167 COIL20(11) 91.70±0.739 92.98±1.404 87.16±1.184 93.43±1.543 93.46±1.327 93.55±1.205 92.71±1.618 COIL20(13) 92.49±1.146 94.29±0.986 88.36±1.283 94.50±0.850 93.92±1.102 94.93±0.788 92.72±1.481 Average Time(13) 0.0038s± 0.0797s± 7.5191s± 0.7812s± 0.7762s± 0.0159s± 0.0053s± Fifteen scene(3) 85.40 1.388 86.83 1.082 86.15 1.504 86.48 1.542 85.89 1.624 86.15 1.073 86.62 1.405 Fifteen scene(6) 89.14±1.033 90.34±0.685 89.97±0.601 90.82±0.921 90.12±0.998 89.65±0.888 90.83±0.737 Fifteen scene(9) 83.42±0.945 88.90±0.544 88.50±1.096 89.31±0.829 89.26±0.781 89.12±0.779 90.64±0.940 Fifteen scene(12) 91.67±0.970 92.06±0.536 92.76±0.905 92.22±0.720 92.45±0.860 92.35±0.706 92.33±0.563 Fifteen scene(15) 93.32±0.609 93.37±0.506 93.63±0.510 93.63±0.787 93.53±0.829 93.84±0.586 93.80±0.461 Fifteen scene(18) 93.61±0.334 94.31±0.551 94.67±0.678 94.28±0.396 94.16±0.344 94.16±0.642 94.78±0.494 Average Time(18) 0.0037s± 0.0759s± 0.9124s± 0.8119s± 0.8500s± 0.1811s± 0.0122s±

TABLE I: Classification accuracies (mean classification error rates standard deviation %) of different sparse representation algorithms with different numbers of training samples. The bold numbers± are the lowest error rates and the least time cost of different algorithms. robustness of sparse representation should be further enhanced. average computational time of OMP and TPTSR is the two In terms of the recognition accuracy, the algorithms of l1 ls, lowest algorithms. Moreover, compared with the l1-regularized homotopy and TPTSR achieve the best overall performance. sparse representation based classification methods, the TPTSR Considering the experimental results of exploiting the seven has very competitive classification accuracy but significantly algorithms on the five databases, the l1 ls algorithm has eight low complexity. Efficient and effective sparse representation highest classification accuracies, followed by homotopy and methods are urgently needed by real-time applications. Thus, TPTSR, in comparison with other algorithms. One can see developing more efficient and effective methods is essential that the sparse representation based classification methods for future study on sparse representation. still can not obtain satisfactory results on some challenge Finally, the extensive experimental results have demon- databases. For example, all these representative algorithms can strated that there is no absolute winner that can achieve achieve relatively inferior experimental results on the LFW the best performance for all datasets in terms of classifica- dataset shown in Subsection IX-B, because the LFW dataset tion accuracy and computational efficiency. However, l1 ls, is designed for studying the problem of unconstrained face TPTSR and homotopy algorithms as a whole outperform the recognition [217] and most of the face images are captured other algorithms. As a compromising approach, the OMP under complex environments. One can see that the PALM algorithm can achieve distinct efficiency without sacrificing algorithm has the worst classification accuracy on the LFW much recognition rate in comparison with other algorithms and dataset and the classification accuracy even decreases mostly it also has been extensively applied to some complex learning with the increase of the number of the training samples. Thus, algorithms as a function. devising more robust sparse representation algorithm is an urgent issue. X. CONCLUSION Third, enough attention should be paid on the computational inefficiency of sparse representation with l1-norm minimiza- Sparse representation has been extensively studied in recent tion. One can see that high computational complexity is one of years. This paper summarizes and presents various available the most major drawbacks of the current sparse representation sparse representation methods and discusses their motivations, methods and also hampers its applications in real-time process- mathematical representations and extensive applications. More ing scenarios. In terms of speed, PALM, FISTA and DALM specifically, we have analyzed their relations in theory and take much longer time to converge than the other methods. The empirically introduced the applications including dictionary JOURNAL 33 learning based on sparse representation and real-world appli- possible applications, such as event detection, scene recon- cations such as image processing, image classification, and struction, video tracking, object recognition, object pose es- visual tracking. timation, medical image processing, genetic expression and Sparse representation has become a fundamental tool, which natural language processing. For example, the study of sparse has been embedded into various learning systems and also has representation in visual tracking is an important direction and received dramatic improvements and unprecedented achieve- more depth studies are essential to future further improvements ments. Furthermore, dictionary learning is an extremely pop- of visual tracking research. ular topic and is closely connected with sparse representa- In addition, most sparse representation and dictionary learn- tion. Currently, efficient sparse representation, robust sparse ing algorithms focus on employing the l0-norm or l1-norm representation, and dictionary learning based on sparse rep- regularization to obtain a sparse solution. However, there are resentation seem to be the main streams of research on still only a few studies on l2,1-norm regularization based sparse sparse representation methods. The low-rank representation representation and dictionary learning algorithms. Moreover, technique has also recently aroused intensive research interests other extended studies of sparse representation may be fruitful. and sparse representation has been integrated into low-rank In summary, the recent prevalence of sparse representation representation for constructing more reliable representation has extensively influenced different fields. It is our hope models. However, the mathematical justification of low-rank that the review and analysis presented in this paper can representation seems not to be elegant as sparse representation. help and motivate more researchers to propose perfect sparse Because employing the ideas of sparse representation as a representation methods. prior can lead to state-of-the-art results, incorporating sparse representation with low-rank representation is worth further ACKNOWLEDGMENT research. Moreover, subspace learning also has been becoming This work was supported in part by the National Nat- one of the most prevailing techniques in pattern recognition ural Science Foundation of China under Grant 61370163, and computer vision. It is necessary to further study the rela- Grant 61233011, and Grant 61332011, and in part by tionship between sparse representation and subspace learning, the Shenzhen Municipal Science and Technology Innova- and constructing more compact models for sparse subspace tion Council under Grant JCYJ20130329151843309, Grant learning becomes one of the popular topics in various research JCYJ20130329151843309, Grant JCYJ20140417172417174, fields. The transfer learning technique has emerged as a new Grant CXZZ20140904154910774, the China Postdoctoral Sci- learning framework for classification, regression and clustering ence Foundation funded project, No. 2014M560264 and the problems in data mining and machine learning. However, Shaanxi Key Innovation Team of Science and Technology sparse representation research still has been not fully applied to (Grant no.: 2012KCT-04). the transfer learning framework and it is significant to unify the We would like to thank Jian Wu for many inspiring dis- sparse representation and low-rank representation techniques cussions and he is ultimately responsible for many of the into the transfer learning framework to solve domain adaption, ideas in the algorithm and analysis. We would also like to multitask learning, sample selection bias and covariate shift thank Dr. Zhihui Lai, Dr. Jinxing Liu and Xiaozhao Fang problems. Furthermore, researches on deep learning seems for constructive suggestions. Moreover, we thank the editor, to become an overwhelming trend in the computer vision an associate editor, and referees for helpful comments and field. However, dramatically expensive training effort is the suggestions which greatly improved this paper. main limitation of current deep learning technique and how to fully introduce current sparse representation methods into the REFERENCES framework of deep learning is valuable and unsolved. [1] B. K. Natarajan, “Sparse approximate solutions to linear The application scope of sparse representation has emerged systems,” SIAM journal on computing, vol. 24, no. 2, and has been widely extended to machine learning and pp. 227–234, 1995. computer vision fields. Nevertheless, the effectiveness and [2] M. Huang, W. Yang, J. Jiang, Y. Wu, Y. Zhang, efficiency of sparse representation methods cannot perfectly W. Chen, and Q. Feng, “Brain extraction based on meet the need for real-world applications. Especially, the locally linear representation-based classification,” Neu- complexities of sparse representation have greatly affected roImage, vol. 92, pp. 322–339, 2014. the applicability, especially the applicability to large scale [3] X. Lu and X. Li, “Group sparse reconstruction for problems. Enhancing the robustness of sparse representation is image segmentation,” Neurocomputing, vol. 136, pp. considered as another indispensable problem when researchers 41–48, 2014. design algorithms. For image classification, the robustness [4] M. Elad, M. Figueiredo, and Y. Ma, “On the role of should be seriously considered, such as the robustness to sparse and redundant representations in image process- random corruptions, varying illuminations, outliers, occlusion ing,” Proceedings of the IEEE, vol. 98, no. 6, pp. 972– and complex backgrounds. Thus, developing an efficient and 982, 2010. robust sparse representation method for sparse representation [5] S. Mallat, A wavelet tour of signal processing: the is still the main challenge and to design a more effective dic- sparse way. Academic press, 2008. tionary is being expected and is beneficial to the performance [6] J. Starck, F. Murtagh, and J. M. Fadili, Sparse image improvement. and signal processing: wavelets, curvelets, morpholog- Sparse representation still has wide potential for various ical diversity. Cambridge University Press, 2010. JOURNAL 34

[7] M. Elad, Sparse and redundant representations: from 1408–1425, 2013. theory to applications in signal and image processing. [23] J. A. Tropp, A. C. Gilbert, and M. J. Strauss, “Algo- Springer, 2010. rithms for simultaneous sparse approximation. part i: [8] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From Greedy pursuit,” Signal Processing, vol. 86, no. 3, pp. sparse solutions of systems of equations to sparse 572–588, 2006. modeling of signals and images,” SIAM review, vol. 51, [24] J. A. Tropp, “Algorithms for simultaneous sparse ap- no. 1, pp. 34–81, 2009. proximation. part ii: Convex relaxation,” Signal Pro- [9] Y. Xu, D. Zhang, J. Yang, and J. Yang, “A two-phase cessing, vol. 86, no. 3, pp. 589–602, 2006. test sample sparse representation method for use with [25] M. Schmidt, G. Fung, and R. Rosales, “Optimization face recognition,” IEEE Transactions on Circuits and methods for l1-regularization,” University of British Systems for Video Technology, vol. 21, no. 9, pp. 1255– Columbia, West Mall Vancouver, B.C. Canada V6T 1262, 2011. 1Z4, Tech. Rep., 2009. [10] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and [26] E. Amaldi and V. Kann, “On the approximability of S. Yan, “Sparse representation for computer vision and minimizing nonzero variables or unsatisfied relations pattern recognition,” Proceedings of the IEEE, vol. 98, in linear systems,” Theoretical Computer Science, vol. no. 6, pp. 1031–1044, 2010. 209, no. 1, pp. 237–260, 1998. [11] D. L. Donoho, “Compressed sensing,” IEEE Transac- [27] J. A. Tropp, “Greed is good: Algorithmic results for tions on Information Theory, vol. 52, no. 4, pp. 1289– sparse approximation,” IEEE Transactions on Informa- 1306, 2006. tion Theory, vol. 50, no. 10, pp. 2231–2242, 2004. [12] R. G. Baraniuk, “Compressive sensing,” IEEE signal [28] N. Parikh and S. Boyd, “Proximal algorithms,” Foun- processing magazine, vol. 24, no. 4, pp. 118–121, 2007. dations and Trends in Optimization, vol. 1, no. 3, pp. [13] E. J. Cand`es, J. Romberg, and T. Tao, “Robust un- 123–231, 2013. certainty principles: Exact signal reconstruction from [29] D. Donoho and Y. Tsaig, “Fast solution of l1-norm min- highly incomplete frequency information,” IEEE Trans- imization problems when the solution may be sparse.” actions on Information Theory, vol. 52, no. 2, pp. 489– IEEE Transactions on Information Theory, vol. 54, 509, 2006. no. 11, pp. 4789–4812, 2008. [14] E. J. Candes and M. B. Wakin, “An introduction to com- [30] Z. Zhang, L. Wang, Q. Zhu, Z. Liu, and Y. Chen, pressive sampling,” IEEE Signal Processing Magazine, “Noise modeling and representation based classification vol. 25, no. 2, pp. 21–30, 2008. methods for face recognition,” Neurocomputing, vol. [15] Y. Tsaig and D. L. Donoho, “Extensions of compressed 148, pp. 420–429, 2015. sensing,” Signal processing, vol. 86, no. 3, pp. 549–571, [31] S. Boyd and L. Vandenberghe, Convex optimization. 2006. Cambridge university press, 2009. [16] E. J. Candes, “Compressive sampling,” in Proceed- [32] J. Chen and X. Huo, “Theoretical results on sparse ings of the International Congress of Mathematicians: representations of multiple-measurement vectors,” IEEE Madrid, August 22-30, 2006: invited lectures, 2006, pp. Transactions on Signal Processing, vol. 54, no. 12, pp. 1433–1452. 4634–4643, 2006. [17] E. Candes and J. Romberg, “Sparsity and incoherence [33] J. K. Pant, W. S. Lu, and A. Antoniou, “Unconstrained in compressive sampling,” Inverse problems, vol. 23, regularized lp-norm based algorithm for the reconstruc- no. 3, p. 969, 2007. tion of sparse signals,” in Proceedings of the IEEE In- [18] X. Lu, H. Wu, Y. Yuan, P. Yan, and X. Li, “Manifold ternational Symposium on Circuits and Systems, 2011, regularized sparse nmf for hyperspectral unmixing,” pp. 1740–1743. IEEE Transactions on Geoscience and Remote Sensing, [34] R. Chartrand and W. Yin, “Iteratively reweighted algo- vol. 51, no. 5, pp. 2815–2826, 2013. rithms for compressive sensing,” in Proceedings of the [19] Y. Yuan, X. Li, Y. Pang, X. Lu, and D. Tao, “Binary IEEE international conference on Acoustics, speech and sparse nonnegative matrix factorization,” IEEE Trans- signal processing, 2008, pp. 3869–3872. actions on Circuits and Systems for Video Technology, [35] J. Yang, L. Zhang, Y. Xu, and J. Yang, “Beyond spar- vol. 19, no. 5, pp. 772–779, 2009. sity: The role of l1-optimizer in pattern classification,” [20] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Pattern Recognition, vol. 45, no. 3, pp. 1104–1118, Y. Ma, “Robust face recognition via sparse represen- 2012. tation,” IEEE Transactions on Pattern Analysis and [36] J. A. Tropp and A. C. Gilbert, “Signal recovery Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2009. from random measurements via orthogonal matching [21] Z. Zhang, Z. Li, B. Xie, L. Wang, and Y. Chen, pursuit,” IEEE Transactions on Information Theory, “Integrating globality and locality for robust represen- vol. 53, no. 12, pp. 4655–4666, 2007. tation based classification,” Mathematical Problems in [37] D. Needell and R. Vershynin, “Uniform uncertainty Engineering, vol. 2014, 2014. principle and signal recovery via regularized orthog- [22] H. Cheng, Z. Liu, L. Yang, and X. Chen, “Sparse rep- onal matching pursuit,” Foundations of computational resentation and learning in visual recognition: Theory mathematics, vol. 9, no. 3, pp. 317–334, 2009. and applications,” Signal Processing, vol. 93, no. 6, pp. [38] R. Saab, R. Chartrand, and O. Yilmaz, “Stable sparse JOURNAL 35

approximations via nonconvex optimization,” in Pro- of Sciences, vol. 100, no. 5, pp. 2197–2202, 2003. ceedings of the IEEE International Conference on [53] L. Liu, L. Shao, F. Zheng, and X. Li, “Realistic ac- Acoustics, Speech and Signal Processing, 2008, pp. tion recognition via sparsely-constructed gaussian pro- 3885–3888. cesses,” Pattern Recognition, vol. 47, no. 12, pp. 3819– [39] R. Chartrand, “Exact reconstruction of sparse signals 3827, 2014. via nonconvex minimization,” IEEE Signal Processing [54] V. M. Patel and R. Chellappa, “Sparse representa- Letters, vol. 14, no. 10, pp. 707–710, 2007. tions, compressive sensing and dictionaries for pattern [40] Z. B. Xu, “Data modeling: Visual psychology approach recognition,” in Proceedings of the IEEE First Asian and l1/2 regularization theory,” in Proceeding of Inter- Conference on Pattern Recognition, 2011, pp. 325–329. national Congress of Mathmaticians, 2010, pp. 3151– [55] Y. Yuan, X. Lu, and X. Li, “Learning hash functions 3184. using sparse reconstruction,” in Proceedings of Inter- [41] R. Tibshirani, “Regression shrinkage and selection via national Conference on Internet Multimedia Computing the lasso,” Journal of the Royal Statistical Society Series and Service, 2014, pp. 14–18. B, pp. 267–288, 1996. [56] D. Donoho, “For most large underdetermined systems [42] B. Efron, T. Hastie, I. Johnstone, I. Johnstone, and of linear equations the minimal l1-norm solution is also R. Tibshirani, “Least angle regression,” The Annals of the sparsest solution,” Communications on pure and statistics, vol. 32, no. 2, pp. 407–499, 2004. applied mathematics, vol. 59, no. 6, pp. 797–829, 2006. [43] J. Yang and Y. Zhang, “Alternating direction algorithms [57] E. Candes, J. Romberg, , and T. Tao, “Stable signal re- for l1-problems in compressive sensing,” SIAM journal covery from incomplete and inaccurate measurements,” on scientific computing, vol. 33, no. 1, pp. 250–278, Communications on pure and applied mathematics, 2011. vol. 59, no. 8, pp. 1207–1223, 2006. [44] M. Schmidt, G. Fung, and R. Rosales, “Fast optimiza- [58] E. Candes and T. Tao, “Near-optimal signal re- tion methods for l1 regularization: A comparative study covery from random projections: Universal encoding and two new approaches,” in Machine Learning: ECML strategies?” IEEE Transaction on Information Theory, 2007. Springer, 2007, pp. 286–297. vol. 52, no. 12, pp. 5406–5425, 2006. [45] F. Nie, H. Huang, X. Cai, and C. Ding, “Efficient [59] L. Qin, Z. C. Lin, Y. Y. She, and C. Zhang, “A com- and robust feature selection via joint l2,1-norms min- parison of typical lp minimization algorithms,” Neuro- imization,” in Proceedings of the Advances in Neural computing, vol. 119, pp. 413–424, 2013. Information Processing Systems, 2010, pp. 1813–1821. [60] Z. Xu, X. Chang, F. Xu, and H. Zhang, “l1/2 regu- [46] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou, larization: A thresholding representation theory and a “l2,1-norm regularized discriminative feature selection fast solver,” IEEE Transactions on Neural Networks and for unsupervised learning,” in Proceedings of the In- Learning Systems, vol. 23, no. 7, pp. 1013–1027, 2012. ternational Joint Conference on Artificial Intelligence, [61] S. Guo, Z. Wang, and Q. Ruan, “Enhancing sparsity via vol. 22, no. 1, 2011, pp. 1589–1594. lp (0

[67] D. L. Donoho, Y. Tsaig, I. Drori, and J. Starck, “Sparse [81] P. L. Combettes and J. C. Pesquet, “Proximal splitting solution of underdetermined systems of linear equations methods in signal processing,” in Fixed-point algo- by stagewise orthogonal matching pursuit,” IEEE Trans- rithms for inverse problems in science and engineering, actions on Information Theory, vol. 58, no. 2, pp. 1094– 2011, pp. 185–212. 1121, 2012. [82] A. Beck and M. Teboulle, “A fast iterative shrinkage- [68] W. Dai and O. Milenkovic, “Subspace pursuit for com- thresholding algorithm for linear inverse problems,” pressive sensing signal reconstruction,” IEEE Transac- SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp. tions on Information Theory, vol. 55, no. 5, pp. 2230– 183–202, 2009. 2249, 2009. [83] A. Y. Yang, S. Sastryand, A. Ganesh, and Y. Ma, “Fast [69] T. T. Do, L. Gan, N. Nguyen, and T. Tran, “Sparsity l1-minimization algorithms and an application in robust adaptive matching pursuit algorithm for practical com- face recognition: A review,” in Proceedings of the IEEE pressed sensing,” in Proceedings of the 42nd Asilomar International Conference on Image Processing (ICIP), Conference on Signals, Systems and Computers, 2008, 2010, pp. 1849–1852. pp. 581–587. [84] S. J. Wright, R. D. Nowak, and M. Figueiredo, [70] P. Jost, P. Vandergheynst, and P. Frossard, “Tree-based “Sparse reconstruction by separable approximation,” pursuit: Algorithm and properties,” IEEE Transactions IEEE Transactions on Signal Processing, vol. 57, no. 7, on Signal Processing, vol. 54, no. 12, pp. 4685–4697, pp. 2479–2493, 2009. 2006. [85] Y. H. Dai, W. Hager, K. Schittkowski, and H. Zhang, [71] C. La and M. N. Do, “Tree-based orthogonal matching “The cyclic barzilai—borwein method for uncon- pursuit algorithm for signal reconstruction,” in IEEE strained optimization,” IMA Journal of Numerical Anal- International Conference on Image Processing, 2006, ysis, vol. 26, no. 3, pp. 604–627, 2006. pp. 1277–1280. [86] E. T. Hale, W. Yin, and Y. Zhang, “A fixed-point con- [72] N. B. Karahanoglu and H. Erdogan, “Compressed tinuation method for l1-regularized minimization with sensing signal recovery via forward–backward pursuit,” applications to compressed sensing,” Rice University, Digital Signal Processing, vol. 23, no. 5, pp. 1539– St, Houston, USA, Tech. Rep., 2007. 1548, 2013. [87] J. Zeng, Z. Xu, B. Zhang, W. Hong, and Y. Wu, “Ac- [73] M. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradi- celerated l1/2 regularization based sar imaging via bcr ent projection for sparse reconstruction: Application to and reduced newton skills,” Signal Processing, vol. 93, compressed sensing and other inverse problems,” IEEE no. 7, pp. 1831–1844, 2013. Journal of Selected Topics in Signal Processing, vol. 1, [88] H. J. Zeng, S. Lin, Y. Wang, and Z. Xu, “l1/2 reg- no. 4, pp. 586–597, 2007. ularization: convergence of iterative half thresholding [74] S. J. Kim, K. Koh, M. Lustig, S. Boyd, and algorithm,” IEEE Transactions on Signal Processing, D. Gorinevsky, “An interior-point method for large- vol. 62, no. 9, pp. 2317–2329, 2014. scale l1-regularized least squares,” IEEE Journal of [89] A. Yang, Z. Zhou, A. Balasubramanian, S. Sastry, and Selected Topics in Signal Processing, vol. 1, no. 4, pp. Y. Ma, “Fast l1-minimization algorithms for robust face 606–617, 2007. recognition,” IEEE Transactions on Image Processing, [75] L. F. Portugal, M. Resende, G. Veiga, and J. Judice, vol. 22, no. 8, pp. 3234–3246, 2013. “A truncated primal-infeasible dual-feasible network [90] M. Elad, B. Matalon, and M. Zibulevsky, “Coordinate interior point method,” Networks, vol. 35, no. 2, pp. and subspace optimization methods for linear least 91–108, 2000. squares with non-quadratic regularization,” Applied and [76] K. Koh, S. J. Kim, and S. P. Boyd, “An interior-point Computational Harmonic Analysis, vol. 23, no. 3, pp. method for large-scale l1-regularized logistic regres- 346–367, 2007. sion.” Journal of Machine learning research, vol. 8, [91] D. L. Donoho, A. Maleki, and A. Montanari, “Message- no. 8, pp. 1519–1555, 2007. passing algorithms for compressed sensing,” Proceed- [77] S. Mehrotra, “On the implementation of a primal-dual ings of the National Academy of Sciences, vol. 106, interior point method,” SIAM Journal on optimization, no. 45, pp. 18914–18919, 2009. vol. 2, no. 4, pp. 575–601, 1992. [92] S. Becker, J. Bobin, and E. J. Candes, “Nesta: a fast and [78] S. J. Wright, Primal-dual interior-point methods. accurate first-order method for sparse recovery,” SIAM SIAM, 1997, vol. 54. Journal on Imaging Sciences, vol. 4, no. 1, pp. 1–39, [79] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, 2011. “Distributed optimization and statistical learning via the [93] S. R. Becker, E. J. Candes, and M. C. Grant, “Templates alternating direction method of multipliers,” Founda- for convex cone problems with applications to sparse tions and Trends in Machine Learning, vol. 3, no. 1, signal recovery,” Mathematical Programming Compu- pp. 1–122, 2011. tation, vol. 3, no. 3, pp. 165–218, 2011. [80] M. Figueiredo and R. Nowak, “A bound optimization [94] M. R. Osborne, B. Presnell, and B. A. Turlach, “A approach to wavelet-based image deconvolution,” in new approach to variable selection in least squares Proceedings of the IEEE International Conference on problems,” IMA journal of numerical analysis, vol. 20, Image Processing, vol. 2, 2005, pp. II–782–5. no. 3, pp. 389–403, 2000. JOURNAL 37

[95] M. D. Plumbley, “Recovery of sparse representations tion,” IEEE Transactions on Image Processing, vol. 14, by polytope faces pursuit,” in Independent Component no. 12, pp. 2091–2106, 2005. Analysis and Blind Signal Separation, 2006, pp. 206– [111] E. Simoncelli, W. Freeman, E. Adelson, and D. Heeger, 213. “Shiftable multiscale transforms,” IEEE Transactions [96] M. S. Asif and J. Romberg, “Fast and accurate algo- on Information Theory, vol. 38, no. 2, pp. 587–607, rithms for re-weighted l1-norm minimization,” IEEE 1992. Transactions on Signal Processing, vol. 61, no. 23, pp. [112] J. Shi, X. Ren, G. Dai, J. Wang, and Z. Zhang, “A non- 5905 – 5916, 2012. convex relaxation approach to sparse dictionary learn- [97] D. M. Malioutov, M. Cetin, and A. S. Willsky, “Homo- ing,” in Proceedings of the IEEE Conference on Com- topy continuation for sparse signal representation,” in puter Vision and Pattern Recognition (CVPR), 2011, pp. Proceedings of the IEEE International Conference on 1809–1816. Acoustics, Speech, and Signal Processing (ICASSP’05), [113] J. Fan and R. Li, “Variable selection via nonconcave vol. 5, 2005, pp. v733–v736. penalized likelihood and its oracle properties,” Journal [98] P. Garrigues and L. E. Ghaoui, “An homotopy algorithm of the American Statistical Association, vol. 96, no. 456, for the lasso with online observations,” in Advances in pp. 1348–1360, 2001. neural information processing systems, 2009, pp. 489– [114] C. H. Zhang, “Nearly unbiased variable selection under 496. minimax concave penalty,” The Annals of Statistics, pp. [99] M. S. Asif, “Primal dual pursuit: A homotopy based 894–942, 2010. algorithm for the dantzig selector,” Master’s thesis, [115] J. H. Friedman, “Fast sparse regression and classifi- Georgia Institute of Technology, Atlanta, Georgia, USA, cation,” International Journal of Forecasting, vol. 28, 2008. no. 3, pp. 722–738, 2012. [100] M. S. Asif and J. Romberg, “Dynamic updating for [116] R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictio- l1 minimization,” IEEE Journal of Selected Topics in naries for sparse representation modeling,” Proceedings Signal Processing, vol. 4, no. 2, pp. 421–434, 2010. of the IEEE, vol. 98, no. 6, pp. 1045–1057, 2010. [101] ——, “Sparse recovery of streaming signals using l1- [117] I. Tosic and P. Frossard, “Dictionary learning,” IEEE homotopy,” arXiv preprint arXiv:1306.3331, 2013. Signal Processing Magazine, vol. 28, no. 2, pp. 27–38, [102] M. S. Asif, “Dynamic compressive sensing: Sparse 2011. recovery algorithms for streaming signals and video,” [118] C. Bao, H. Ji, Y. Quan, and Z. Shen, “l0 norm based Ph.D. dissertation, Georgia Institute of Technology, dictionary learning by proximal methods with global Atlanta, Georgia, Unit State of America, 2013. convergence,” in Proceedings of the IEEE Conference [103] J. Cooley and J. Tukey, “An algorithm for the machine on Computer Vision and Pattern Recognition, 2014, pp. calculation of complex fourier series,” Mathematics of 3858–3865. computation, vol. 19, no. 90, pp. 297–301, 1965. [119] A. Olshausen and D. J. Field, “Sparse coding with an [104] R. Rubinstein, A. Bruckstein, and M. Elad, “Dictionar- overcomplete basis set: A strategy employed by v1?” ies for sparse representation modeling,” Proceedings of Vision research, vol. 37, no. 23, pp. 3311–3325, 1997. the IEEE, vol. 98, no. 6, pp. 1045–1057, 2010. [120] K. Engan, S. O. Aase, and J. H.Husy, “Multi-frame [105] L. Shao, R. Yan, X. Li, and Y. Liu, “From heuristic compression: Theory and design,” Signal Processing, optimization to dictionary learning: a review and com- vol. 80, no. 10, pp. 2121–2140, 2000. prehensive comparison of image denoising algorithms,” [121] K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, IEEE Transactions on Cybernetics, vol. 44, no. 7, pp. T. W. Lee, and T. J. Sejnowski, “Dictionary learning 1001–1013, 2014. algorithms for sparse representation,” Neural computa- [106] E. Simoncelli and E. Adelson, “Noise removal via tion, vol. 15, no. 2, pp. 349–396, 2003. bayesian wavelet coring,” in International Conference [122] M. E. M. Aharon and A. Bruckstein, “k-svd: An on Image Processing, vol. 1, 1996, pp. 379–382. algorithm for designing overcomplete dictionaries for [107] W. He, Y. Zi, B. Chen, F. Wu, and Z. He, “Automatic sparse representation,” IEEE Transactions on Signal fault feature extraction of mechanical anomaly on in- Processing, vol. 54, no. 11, pp. 4311–4322, 2006. duction motor bearing using ensemble super-wavelet [123] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach, transform,” Mechanical Systems and Signal Processing, “Proximal methods for sparse hierarchical dictionary vol. 54, pp. 457–480, 2015. learning,” in Proceedings of the 27th International [108] E. L. Pennec and S. Mallat, “Sparse geometric image Conference on Machine Learning, 2010, pp. 487–494. representations with bandelets,” IEEE Transactions on [124] S. Bengio, F. Pereira, Y. Singer, and D. Strelow, “Group Image Processing, vol. 14, no. 4, pp. 423–438, 2005. sparse coding,” in Proceedings of the Advances in [109] J. Starck, E. Candes, and D. Donoho, “The curvelet Neural Information Processing Systems, 2009, pp. 82– transform for image denoising,” Image Processing, 89. IEEE Transactions on, vol. 11, no. 6, pp. 670–684, [125] B. Zhao, F. L, and E. P. Xing, “Online detection of 2002. unusual events in videos via dynamic sparse coding,” [110] M. Do and M. Vetterli, “The contourlet transform: an in Proceedings of the IEEE Conference on Computer efficient directional multiresolution image representa- Vision and Pattern Recognition, 2011, pp. 3313–3320. JOURNAL 38

[126] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust tion,” in Proceedings of the IEEE International Confer- visual tracking via multi-task sparse learning,” in Pro- ence on Image Processing (ICIP), 2010, pp. 1601–1604. ceedings of the IEEE Conference on Computer Vision [140] I. Ramirez, P. Sprechmann, and G. Sapiro, “Classifica- and Pattern Recognition, 2012, pp. 2042–2049. tion and clustering via dictionary learning with struc- [127] Q. Zhang and B. Li, “Discriminative k-svd for dictio- tured incoherence and shared features,” in Proceedings nary learning in face recognition,” in Proceedings of of the IEEE Conference on Computer Vision and Pattern the IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3501–3508. Recognition, 2010, pp. 2691–2698. [141] S. Kong and D. Wang, “A dictionary learning approach [128] Y. Yang, Y. Yang, Z. Huang, H. T. Shen, and F. Nie, for classification: separating the particularity and the “Tag localization with spatial correlations and joint commonality,” in European Conference on Computer group sparsity,” in Proceedings of the IEEE Conference Vision (ECCV), 2012, pp. 186–199. on Computer Vision and Pattern Recognition, 2011, pp. [142] S. Gao, I. W. Tsang, and Y. Ma, “Learning category- 881–888. specific dictionary and shared dictionary for fine- [129] O. Bryt and M. Elad, “Compression of facial images grained image categorization,” IEEE Transactions on using the k-svd algorithm,” Journal of Visual Commu- Image Processing, vol. 23, no. 2, pp. 623–634, 2013. nication and Image Representation, vol. 19, no. 4, pp. [143] M. G. Jafari and M. D. Plumbley, “Fast dictionary 270–282, 2008. learning for sparse representations of speech signals,” [130] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, IEEE Journal of Selected Topics in Signal Processing, “Locality-constrained linear coding for image classi- vol. 5, no. 5, pp. 1025–1031, 2011. fication,” in Proceedings of the IEEE Conference on [144] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisser- Computer Vision and Pattern Recognition, 2010, pp. man, “Supervised dictionary learning,” in Proceedings 3360–3367. of the Advances in neural information processing sys- [131] M. Zhou, H. Chen, J. Paisley, L. Ren, L. Li, Z. Xing, tems, 2009, pp. 1033–1040. D. Dunson, G. Sapiro, and L. Carin, “Nonparametric [145] N. Zhou, Y. Shen, J. Peng, and J. Fan, “Learning bayesian dictionary learning for analysis of noisy and inter-related visual dictionary for object recognition,” incomplete images,” IEEE Transactions on Image Pro- in Proceedings of the IEEE Conference on Computer cessing, vol. 21, no. 1, pp. 130–144, 2012. Vision and Pattern Recognition, 2012, pp. 3490–3497. [132] I. Ramirez and G. Sapiro, “An mdl framework for sparse [146] L. Ma, C. Wang, B. Xiao, and W. Zhou, “Sparse coding and dictionary learning,” IEEE Transactions on representation for face recognition based on discrimi- Signal Processing, vol. 60, no. 6, pp. 2913–2927, 2012. native low-rank dictionary learning,” in Proceedings of [133] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online the IEEE Conference on Computer Vision and Pattern learning for matrix factorization and sparse coding,” The Recognition, 2012, pp. 2586–2593. Journal of Machine Learning Research, vol. 11, pp. 19– [147] J. Lu, G. Wang, W. Deng, and P. Moulin, “Simultaneous 60, 2010. feature and dictionary learning for image set based [134] M. Yang, L. Van, and L. Zhang, “Sparse variation face recognition,” in European Conference on Computer dictionary learning for face recognition with a single Vision (ECCV), 2014, pp. 265–280. sample per person,” in Proceedings of the IEEE Inter- [148] M. Yang, D. Dai, L. Shen, and L. V. Gool, “Latent national Conference on Computer Vision (ICCV), 2013, dictionary learning for sparse representation based clas- pp. 689–696. sification,” in Proceedings of the IEEE Conference on [135] Z. Jiang, Z. Lin, and L. S. Davis, “Learning a discrim- Computer Vision and Pattern Recognition, 2014, pp. inative dictionary for sparse coding via label consistent 4138–4145. k-svd,” in Proceedings of the IEEE Conference on Com- [149] Z. Jiang, G. Zhang, and L. S. Davis, “Submodular puter Vision and Pattern Recognition (CVPR), 2011, pp. dictionary learning for sparse coding,” in Proceedings of 1697–1704. the IEEE Conference on Computer Vision and Pattern [136] Z. Jiang, Z. Lin, and L. Davis, “Label consistent k-svd: Recognition, 2012, pp. 3418–3425. learning a discriminative dictionary for recognition,” [150] S. Cai, W. Zuo, L. Zhang, X. Feng, and P. Wang, IEEE Transactions on Pattern Analysis and Machine “Support vector guided dictionary learning,” in Euro- Intelligence, vol. 35, no. 11, pp. 2651–2664, 2013. pean Conference on Computer Vision (ECCV), 2014, [137] M. Yang, D. Zhang, and X. Feng, “Fisher discrimina- pp. 624–639. tion dictionary learning for sparse representation,” in [151] X. Qin, J. Shen, X. Li, and Y. Jia, “A new sparse feature- Proceedings of the IEEE International Conference on based patch for dense correspondence,” in Proceedings Computer Vision, 2011, pp. 543–550. of the IEEE International Conference on Multimedia [138] L. Rosasco, A. Verri, M. Santoro, S. Mosci, and S. Villa, and Expo (ICME), 2014, pp. 1–6. “Iterative projection methods for structured sparsity [152] L. He, D. Tao, X. Li, and X. Gao, “Sparse representation regularization,” Massachusetts institute of technology, for blind image quality assessment,” in Proceedings of cambridge, MA, USA, Tech. Rep., 2009. the IEEE International Conference on Computer Vision [139] M. Yang, L. Zhang, J. Yang, and D. Zhang, “Metaface and Pattern Recognition (CVPR), 2012, pp. 1146–1153. learning for sparse representation based face recogni- [153] J. Yang, J. Wright, Y. Ma, and T. Huang, “Image JOURNAL 39

super-resolution as sparse representation of raw image laborative filtering,” IEEE Transactions on Image Pro- patches,” in Proceedings of the IEEE Conference on cessing, vol. 16, no. 8, pp. 2080–2095, 2007. Computer Vision and Pattern Recognition, 2008, pp. 1– [168] J. Mairal, M. Elad, and G. Sapiro, “Sparse representa- 8. tion for color image restoration,” IEEE Transactions on [154] W. T. Freeman, T. R. Jones, and E. C. Pasz- Image Processing, vol. 17, no. 1, pp. 53–69, 2008. tor, “Example-based super-resolution,” IEEE Computer [169] M. Protter and M. Elad, “Image sequence denoising via Graphics and Applications, vol. 22, no. 2, pp. 56–65, sparse and redundant representations,” IEEE Transac- 2002. tions on Image Processing, vol. 18, no. 1, pp. 27–35, [155] M. Irani and S. Peleg, “Motion analysis for image 2009. enhancement: Resolution, occlusion, and transparency,” [170] W. Dong, X. Li, D. Zhang, and G. Shi, “Sparsity-based Journal of Visual Communication and Image Represen- image denoising via dictionary learning and structural tation, vol. 4, no. 4, pp. 324–335, 1993. clustering,” in Proceedings of the IEEE Conference on [156] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image Computer Vision and Pattern Recognition, 2011, pp. super-resolution via sparse representation,” IEEE Trans- 457–464. actions on Image Processing, vol. 19, no. 11, pp. 2861– [171] J. Jiang, L. Zhang, and J. Yang, “Mixed noise removal 2873, 2010. by weighted encoding with sparse nonlocal regulariza- [157] Y. Tang, Y. Yuan, P. Yan, and X. Li, “Greedy regres- tion,” IEEE Transactions on Image Processing, vol. 23, sion in sparse coding space for single-image super- no. 6, pp. 2651–2662, 2014. resolution,” Journal of visual communication and image [172] S. Gu, L. Zhang, W. Zuo, and X. Feng, “Weighted representation, vol. 24, no. 2, pp. 148–159, 2013. nuclear norm minimization with application to image [158] J. Zhang, C. Zhao, R. Xiong, S. Ma, and D. Zhao, denoising,” in Proceedings of the IEEE Conference on “Image super-resolution via dual-dictionary learning Computer Vision and Pattern Recognition, 2014. and sparse representation,” in Proceedings of the IEEE [173] H. Ji, C. Liu, Z. Shen, and Y. Xu, “Robust video denois- International Symposium on Circuits and Systems (IS- ing using low rank matrix completion,” in Proceedings CAS), 2012, pp. 1688–1691. of the IEEE Conference on Computer Vision and Pattern [159] X. Gao, K. Zhang, D. Tao, and X. Li, “Image super- Recognition, 2010, pp. 1791–1798. resolution with sparse neighbor embedding,” IEEE [174] P. Cheng, C. Deng, S. Wang, and C. Zhang, “Image Transactions on Image Processing, vol. 21, no. 7, pp. denoising via group sparse representation over learned 3194–3205, 2012. dictionary,” in Proceedings of the Eighth International [160] C. Fernandez-Granda and E. J. Candes, “Super- Symposium on Multispectral Image Processing and Pat- resolution via transform-invariant group-sparse regular- tern Recognition, 2013, pp. 891916–891916. ization,” in IEEE International Conference on Computer [175] J. M. Bioucas-Dias and M. Figueiredo, “A new twist: Vision (ICCV), 2013, pp. 3336–3343. two-step iterative shrinkage/thresholding algorithms for [161] X. Lu, H. Yuan, P. Yan, Y. Yuan, and X. Li, “Geom- image restoration,” IEEE Transactions on Image Pro- etry constrained sparse coding for single image super- cessing, vol. 16, no. 12, pp. 2992–3004, 2007. resolution,” in Proceedings of the IEEE International [176] J. Mairal, G. Sapiro, and M. Elad, “Learning multiscale Conference on Computer Vision and Pattern Recogni- sparse representations for image and video restoration,” tion (CVPR), 2012, pp. 1648–1655. Multiscale Modeling and Simulation, vol. 7, no. 1, pp. [162] W. Dong, G. Shi, L. Zhang, and X. Wu, “Super- 214–241, 2008. resolution with nonlocal regularized sparse representa- [177] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisser- tion,” in Proceedings of the Visual Communications and man, “Non-local sparse models for image restoration,” Image Processing, 2010, pp. 77440H–77440H. in Proceedings of the IEEE International Conference [163] W. Dong, D. Zhang, and G. Shi, “Image deblurring on Computer Vision, 2009, pp. 2272–2279. and super-resolution by adaptive sparse domain selec- [178] D. Zoran and Y. Weiss, “From learning models of tion and adaptive regularization,” IEEE Transactions on natural image patches to whole image restoration,” in Image Processing, vol. 20, no. 7, pp. 1838–1857, 2011. Proceedings of the IEEE International Conference on [164] S. Mallat and G. Yu, “Super-resolution with sparse Computer Vision, 2011, pp. 479–486. mixing estimators,” Transactions on Image Processing, [179] C. Bao, J. F. Cai, and H. Ji, “Fast sparsity-based vol. 19, no. 11, pp. 2889–2900, 2010. orthogonal dictionary learning for image restoration,” [165] D. L. Donoho, “De-noising by soft-thresholding,” IEEE in Proceedings of the IEEE International Conference Transactions on Information Theory, vol. 41, no. 3, pp. on Computer Vision, 2013, pp. 3384–3391. 613–627, 1995. [180] J. Zhang, D. Zhao, and W. Gao, “Group-based sparse [166] M. Elad and M. Aharon, “Image denoising via sparse representation for image restoration,” IEEE Transac- and redundant representations over learned dictionar- tions on Image Processing, vol. 34, no. 9, pp. 1864– ies,” IEEE Transactions on Image Processing, vol. 15, 1870, 2014. no. 12, pp. 3736–3745, 2006. [181] W. Dong, D. Zhang, and G. Shi, “Centralized sparse [167] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, representation for image restoration,” in Proceedings of “Image denoising by sparse 3-d transform-domain col- the IEEE International Conference on Computer Vision, JOURNAL 40

2011, pp. 1259–1266. [196] Y. Zhang, Z. Jiang, and L. S. Davis, “Learning struc- [182] W. Dong, L. Zhang, G. Shi, and X. Li, “Nonlocally tured low-rank representations for image classification,” centralized sparse representation for image restoration,” in Proceedings of the IEEE Conference on Computer IEEE Transactions on Image Processing, vol. 22, no. 4, Vision and Pattern Recognition, 2013, pp. 676–683. pp. 1620–1630, 2013. [197] D. Tao, L. Jin, Y. Zhao, and X. Li, “Rank preserving [183] A. Buades, B. Coll, and J. M. Morel, “A non-local sparse learning for kinect based scene classification,” algorithm for image denoising,” in Computer Vision and IEEE Transactions on Cybernetics, vol. 43, no. 5, pp. Pattern Recognition, 2005. CVPR 2005. IEEE Com- 1406–1417, 2013. puter Society Conference on, vol. 2, 2005, pp. 60–65. [198] Y. Zhang, Z. Jiang, and L. S. Davis, “Discriminative [184] A. Nedic, D. Bertsekas, and A. Ozdaglar, “Convex tensor sparse coding for image classification,” in Pro- analysis and optimization,” Athena Scientific, 2003. ceedings of the British Machine Vision Conference, [185] I. Daubechies, M. Defrise, and C. D. Mol, “An iterative 2013. thresholding algorithm for linear inverse problems with [199] L. Zhuang, S. Gao, J. Tang, J. Wang, Z. Lin, and a sparsity constraint,” Communications on pure and Y. Ma, “Constructing a non-negative low rank and applied mathematics, vol. 57, no. 11, pp. 1413–1457, sparse graph with data-adaptive features,” arXiv preprint 2004. arXiv:1409.0964, 2014. [186] W. Deng, J. Hu, and J. Guo, “Extended src: Undersam- [200] R. Rigamonti, V. Lepetit, G. Gonzlez, E. Turetken, pled face recognition via intraclass variant dictionary,” F. Benmansour, M. Brown, and P. Fua, “On the rele- IEEE Transactions on Pattern Analysis and Machine vance of sparsity for image classification,” Computer Intelligence, vol. 34, no. 9, pp. 1864–1870, 2012. Vision and Image Understanding, vol. 125, pp. 115– [187] ——, “In defense of sparsity based face recognition,” 127, 2014. in Proceedings of the IEEE Conference on Computer [201] X. Mei and H. Ling, “Robust visual tracking using Vision and Pattern Recognition, 2013, pp. 399–406. l1 minimization,” in Proceedings of the IEEE 12th [188] R. He, W. S. Zheng, and B. G. Hu, “Maximum corren- International Conference on Computer Vision, 2009, pp. tropy criterion for robust face recognition,” IEEE Trans- 1436–1443. actions on Pattern Analysis and Machine Intelligence, [202] ——, “Robust visual tracking and vehicle classification vol. 33, no. 8, pp. 1561–1576, 2011. via sparse representation,” IEEE Transactions on Pat- [189] M. Yang, D. Zhang, and J. Yang, “Robust sparse coding tern Analysis and Machine Intelligence, vol. 33, no. 11, for face recognition,” in Proceedings of the IEEE Con- pp. 2259–2272, 2011. ference on Computer Vision and Pattern Recognition, [203] H. Li, C. Shen, and Q. Shi, “Real-time visual tracking 2011, pp. 625–632. using compressive sensing,” in Proceedings of the IEEE [190] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear Conference on Computer Vision and Pattern Recogni- spatial pyramid matching using sparse coding for image tion, 2011, pp. 1305–1312. classification,” in Proceedings of the IEEE Conference [204] S. Zhang, H. Yao, X. Sun, and S. Liu, “Robust object on Computer Vision and Pattern Recognition, 2009, pp. tracking based on sparse representation,” in Proceedings 1794–1801. of the Visual Communications and Image Processing, [191] S. Gao, I. W. H. Tsang, and L. T. Chia, “Kernel sparse 2010, pp. 77441N–77441N–8. representation for image classification and face recog- [205] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust nition,” in European Conference on Computer Vision visual tracking via multi-task sparse learning,” in Pro- (ECCV), 2010, pp. 1–14. ceedings of the IEEE Conference on Computer Vision [192] S. Gao, I. W. Tsang, L. T. Chia, and P. Zhao, “Local fea- and Pattern Recognition, 2012, pp. 2042–2049. tures are not lonely-laplacian sparse coding for image [206] X. Jia, H. Lu, and M. H. Yang, “Visual tracking classification,” in Proceedings of the IEEE Conference via adaptive structural local sparse appearance model,” on Computer Vision and Pattern Recognition, 2010, pp. in Proceedings of the IEEE Conference on Computer 3555–3561. Vision and Pattern Recognition, 2012, pp. 1822–1829. [193] N. Kulkarni and B. Li, “Discriminative affine sparse [207] B. Liu, L. Yang, J. Huang, P. Meer, L. Gong, and C. Ku- codes for image classification,” in Proceedings of the likowski, “Robust and fast collaborative tracking with IEEE Conference on Computer Vision and Pattern two stage sparse optimization,” in European Conference Recognition, 2011, pp. 1609–1616. on Computer Vision (ECCV), 2010, pp. 624–637. [194] C. Zhang, J. Liu, Q. Tian, C. Xu, H. Lu, and S. Ma, [208] B. Liu, J. Huang, C. Kulikowski, and L. Yang, “Robust “Image classification by non-negative sparse coding, tracking using local sparse appearance model and k- low-rank and sparse decomposition,” in Proceedings of selection,” in Proceedings of the IEEE Conference on the IEEE Conference on Computer Vision and Pattern Computer Vision and Pattern Recognition, 2011, pp. Recognition, 2011, pp. 1673–1680. 1313–1320. [195] T. Zhang, B. Ghanem, S. Liu, C. Xu, and N. Ahuja, [209] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust l1 “Low-rank sparse coding for image classification,” in tracker using accelerated proximal gradient approach,” Proceedings of the IEEE International Conference on in Proceedings of the IEEE Conference on Computer Computer Vision, 2013, pp. 281–288. Vision and Pattern Recognition, 2012, pp. 1830–1837. JOURNAL 41

[210] W. Zhong, H. Lu, and M. H. Yang, “Robust object Zheng Zhang received the B.S degree from Henan tracking via sparsity-based collaborative model,” in Pro- University of Science and Technology and M.S degree from Shenzhen Graduate School, Harbin ceedings of the IEEE Conference on Computer Vision Institute of Technology (HIT) in 2012 and 2014, and Pattern Recognition, 2012, pp. 1838–1845. respectively. Currently, he is pursuing the Ph.D. [211] K. Zhang, L. Zhang, and M. Yang, “Fast compressive degree in computer science and technology at Shen- zhen Graduate School, Harbin Institute of Technol- tracking,” IEEE Transactions on Pattern Analysis and ogy, Shenzhen, China. His current research interests Machine Intelligence, vol. 36, no. 10, pp. 2002–2015, include pattern recognition, machine learning and 2014. computer vision. [212] X. Lu, Y. Yuan, and P. Yan, “Robust visual tracking with discriminative sparse learning,” Pattern Recognition, vol. 46, no. 7, pp. 1762–1771, 2013. [213] N. Wang, J. Wang, and D. Y. Yeung, “Online robust non-negative dictionary learning for visual tracking,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 657–664. [214] S. Zhang, H. Yao, X. Sun, and X. Lu, “Sparse coding based visual tracking: Review and experimental com- Yong Xu was born in Sichuan, China, in 1972. He parison,” Pattern Recognition, vol. 46, no. 7, pp. 1772– received his B.S. degree and M.S. degree at Air Force Institute of Meteorology (China) in 1994 and 1788, 2013. 1997, respectively. He received the Ph.D. degree [215] A. Smeulders, D. Chu, R. Cucchiara, S. Calderara, in Pattern recognition and Intelligence System at A. Dehghan, and M. Shah, “Visual tracking: An experi- the Nanjing University of Science and Technology (NUST) in 2005. Now, he works at Shenzhen Grad- mental survey,” IEEE Transactions on Pattern Analysis uate School, Harbin Institute of Technology. His cur- and Machine Intelligence, vol. 36, no. 7, pp. 1442– rent interests include pattern recognition, biometrics, 1468, 2013. machine learning and video analysis. [216] F. Samaria and A. Harter, “Parameterisation of a stochastic model for human face identification,” in Pro- ceedings of the Second IEEE Workshop on Applications of Computer Vision, 1994, pp. 138–142. [217] G. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” Technical Report 07-49, University of Massachusetts, Amherst, Tech. Rep., 2007. Jian Yang received the B.S. degree in mathematics [218] S. Wang, J. Yang, M. Sun, X. Peng, M. Sun, and from the Xuzhou Normal University in 1995. He received the M.S. degree in applied mathematics C. Zhou, “Sparse tensor discriminant color space for from the Changsha Railway University in 1998 and face verification,” IEEE Transactions on Neural Net- the Ph.D. degree from the Nanjing University of works and Learning Systems, vol. 23, no. 6, pp. 876– Science and Technology (NUST), on the subject of pattern recognition and intelligence systems in 888, 2012. 2002. In 2003, he was a postdoctoral researcher at [219] A. Georghiades, P. Belhumeur, and D. Kriegman, “From the University of Zaragoza. From 2004 to 2006, he few to many: Illumination cone models for face recog- was a Postdoctoral Fellow at Biometrics Centre of Hong Kong Polytechnic University. From 2006 to nition under variable lighting and pose,” IEEE Trans- 2007, he was a Postdoctoral Fellow at Department of Computer Science of actions on Pattern Analysis and Machine Intelligence, New Jersey Institute of Technology. Now, he is a professor in the School of vol. 23, no. 6, pp. 643–660, 2001. Computer Science and Technology of NUST. He is the author of more than 80 scientific papers in pattern recognition and computer vision. His journal [220] S. Nene, S. Nayar, and H. Murase, “Columbia object papers have been cited more than 1600 times in the ISI Web of Science, image library (coil-20),” Technical Report CUCS-005- and 2800 times in the Web of Scholar Google. His research interests include 96, Tech. Rep., 1996. pattern recognition, computer vision and machine learning. Currently, he is an associate editor of Pattern Recognition Letters and IEEE TRANSACTION [221] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags ON NEURAL NETWORKS AND LEARNING SYSTEMS, respectively. of features: Spatial pyramid matching for recognizing natural scene categories,” in IEEE Computer Society Conference on Computer Vision and Pattern Recogni- tion, 2006, pp. 2169–2178.

Xuelong Li is a Full Professor with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, Shaanxi, P.R. China. JOURNAL 42

David Zhang graduated in Computer Science from Peking University. He received his M.Sc. in Com- puter Science in 1982 and his Ph.D. in 1985 from the Harbin Institute of Technology (HIT). From 1986 to 1988 he was a Postdoctoral Fellow at Tsinghua University and then an Associate Professor at the Academia Sinica, Beijing. In 1994 he received his second Ph.D. in Electrical and Computer Engi- neering from the University of Waterloo, Ontario, Canada. Currently, he is a Chair Professor at the Hong Kong Polytechnic University where he is the Founding Director of the Biometrics Technology Centre (UGC/CRC) supported by the Hong Kong SAR Government in 1998. He also serves as Visiting Chair Professor in Tsinghua University, and Adjunct Professor in Shanghai Jiao Tong University, Peking University, Harbin Institute of Technology, and the University of Waterloo. He is the Founder and Editor- in-Chief, International Journal of Image and Graphics (IJIG); Book Editor, Springer International Series on Biometrics (KISB); Organizer, the first International Conference on Biometrics Authentication (ICBA); Associate Editor of more than ten international journals including IEEE Transactions and Pattern Recognition; Technical Committee Chair of IEEE CIS and the author of more than 10 books and 200 journal papers. Professor Zhang is a Croucher Senior Research Fellow, Distinguished Speaker of the IEEE Computer Society, and a Fellow of both IEEE and IAPR.