A Convex Model for Non-Negative Matrix Factorization And

Home , Stanley Osher

1 A convex model for non-negative matrix factorization and dimensionality reduction on physical space Ernie Esser, Michael Moller,¨ Stanley Osher, Guillermo Sapiro, Jack Xin

Abstract—A collaborative convex framework for factoring a our approach and discuss applications and related work. In data matrix X into a non-negative product AS, with a sparse Section II we present our proposed convex model for end- coefficient matrix S, is proposed. We restrict the columns of member (dictionary) computation that uses l regularization the dictionary matrix A to coincide with certain columns of the 1,∞ data matrix X, thereby guaranteeing a physically meaningful to select as endmembers a sparse subset of columns of X, dictionary and dimensionality reduction. We use l1,∞ regular- such that sparse non-negative linear combinations of them are ization to select the dictionary from the data and show this capable of representing all other columns. Section III shows leads to an exact convex relaxation of l0 in the case of distinct that in the case of distinct noise free data, l1,∞ regularization noise free data. We also show how to relax the restriction-to-X is an exact relaxation of the ideal row-0 norm (number of constraint by initializing an alternating minimization approach with the solution of the convex model, obtaining a dictionary non-zero rows), and furthermore proves the stability of our close to but not necessarily in X. We focus on applications method in the noisy case. Section IV presents numerical results of the proposed framework to hyperspectral endmember and for both synthetic and real hyperspectral data. In Section V abundances identification and also show an application to blind we present an extension of our convex endmember detection source separation of NMR data. model that is better able to handle outliers in the data. We Index Terms—Non-negative matrix factorization, dictionary discuss its numerical optimization, compare its performance to learning, subset selection, dimensionality reduction, hyperspec- the basic model and also demonstrate its application to a blind tral endmember detection, blind source separation source separation (BSS) problem based on NMR spectroscopy data. I.INTRODUCTION IMENSIONALITY reduction has been widely studied A. Summary of the problem and geometric interpretation D in the signal processing and computational learning The underlying general problem of representing X ≈ AS communities. One of the major drawbacks of virtually all with A, S ≥ 0 is known as non-negative matrix factorization popular approaches for dimensionality reduction is the lack (NMF). Variational models for solving NMF problems are of physical meaning in the reduced dimension space. This typically non-convex and are solved by estimating A and S significantly reduces the applicability of such methods. In this alternatingly. Although variants of alternating minimization work we present a framework for dimensionality reduction, methods for NMF often produce good results in practice, they based on matrix factorization and sparsity theory, that uses the are not guaranteed to converge to a global minimum. data itself (or small variations from it) for the low dimensional The problem can be greatly simplified by assuming a partial representation, thereby guaranteeing the physical fidelity. We orthogonality condition on the matrix S as is done in [1], propose a new convex method to factor a non-negative data [2]. More precisely, the assumption is that for each row i matrix X into a product AS, for which S is non-negative and of S there exists some column j such that S > 0 and sparse and the columns of A coincide with columns from the i,j S = 0 for k 6= i. Under this assumption, NMF has a data matrix X. k,j simple geometric interpretation. Not only should the columns The organization of this paper is as follows. In the remainder of A appear in the data X up to scaling, but the remaining of the introduction, we further explain the problem, summarize data should be expressible as non-negative linear combinations Copyright c 2012 IEEE. Personal use of this material is permitted. How- of these columns. Therefore the problem of finding A is to ever, permission to use this material for any other purposes must be obtained find columns in X, preferably as few as possible, that span a from the IEEE by sending a request to [email protected]. cone containing the rest of the data X. Figure 1 illustrates the EE and JX were supported by NSF grants DMS-0911277, DMS-0928427, MM by the German Academic Exchange Service (DAAD), SO and MM by geometry in three dimensions. NSF grants DMS-0835863, DMS-0914561, DMS-0914856 and ONR grant The problem we actually want to solve is more difficult N00014-08-1119, and GS was supported by NSF, NGA, ONR, ARO, DARPA, than NMF in a couple respects. One reason is the need to and NSSEFF. E. Esser and J. Xin are with the Department of Mathematics, University of deal with noisy data. Whereas NMF by itself is a difficult California Irvine, 340 Rowland Hall, Irvine, CA 92697-3875. M. Moller¨ and problem already, the identification of the vectors becomes even S. Osher are with the Department of Mathematics, University of California more difficult if the data X contains noise and we need to Los Angeles, Box 951555, Los Angeles, CA 90095-1555. G. Sapiro is with the Department of Electrical and Computer Engineering, University of Minnesota, find a low dimensional cone that contains most of the data 200 Union Street SE, 4-174 Keller Hall, Minneapolis, MN 55455-0170. (see lower right image in Figure 1). Notice that in the noisy 2

case, ﬁnding vectors such that all data is contained in the by using l1,∞ regularization. This type of regularization cou- cone they span would lead to a drastic overestimation of the ples the elements in each row of T and is based on the recent number of vectors. Arbitrarily small noise at a single data ideas of collaborative sparsity (see for example [3] and the point could already lead to including this vector into the set references therein). The particular l1,∞ regularization has been of cone spanning vectors. Thus, the problem is ill-posed and studied by Tropp in [4], however, without considering non- regularization is needed to handle noise. In addition to small negativity constraints and also not in the setting of ﬁnding a T noise there could also be outliers in the data, namely columns such that X ≈ XT for physically meaningful dimensionality of X that are not close to being well represented as a non- reduction. A strong motivation for using l1,∞ regularization negative linear combination of other columns, but that we do instead of other row sparsity regularizers like l1,2 is that in not wish to include in A. Such outliers could arise from bad the case of distinct, noise-free data, the l1,∞ model is an exact sensor measurements, non-physical artifacts or any features relaxation of minimizing the number of non-zero rows in T that for some reason we are not interested in including in our such that X = XT . This exact relaxation is independent of dictionary A. Another complication that requires additional the coherence of the columns of X. Without the non-negativity modeling is that for the applications we consider, the matrix constraint, a low coherence is crucial as is shown in [4]. The S should also be sparse, which means we want the data to be general setting X ≈ XT was proposed by Lin et al. in [5] for represented as sparse non-negative linear combinations of the low rank approximation with the nuclear norm. However, the columns of A. nuclear norm does not lead to row sparsity, and thus there is no obvious way to extract dictionary atoms from the minimizer. Both the nuclear norm and l1,∞ approaches are addressing related but different problems. Our main contribution is to apply the joint sparsity regularizer l1,∞ to the non-negative factorization setting X ≈ XT and thereby implicitly select certain columns of X for the description of all columns of X. We pose this as a convex optimization problem in T . For practical reasons, we will need to perform some preliminary data reduction, explained in Section II, before carrying out the convex minimization. The main idea, however, is to minimize over T ≥ 0 the l1,∞ norm of T plus some convex penalty 2 on X − XT . In the simplest case, we penalize kX − XT kF . We also propose an advanced noise model to handle the case where X contains outliers. Both models also incorporate a weighted l1 penalty to encourage a sparser T so that from the few columns of X selected to represent the whole data, only a few are used per sample.

C. Applications and related work Although we concentrate on hyperspectral imaging (HSI) and briefly discuss an application to blind source separation (BSS), our method is applicable in numerous areas, from biology to sensor networks. For instance, one approach to text mining, that is the Fig. 1: Geometric interpretation of the endmember detection technique of extracting important information from large text problem. First row: two different viewpoints for a data set data sets, is to reduce the number of relevant text documents in three dimensions, second row: same data set with the by clustering them into content dependent categories using vectors that can express any data point as a non-negative linear non-negative matrix factorization (see [6] for details). Our combination in red, third row left: cone spanned by the three approach could potentially be used to not only cluster the large vectors containing all data, third row right: illustration of the amount of documents by a similar factorization, but due to the cone with Gaussian noise added to the data, in this case not dictionary being a part of the data, it would furthermore lead all points lie inside the cone anymore to a correspondence of every atom in the dictionary to a certain document. This correspondence might help a human analyzer judge the importance of each cluster. Therefore the physical B. Our proposed approach meaning of the dictionary atoms would have an immediate We obtain the X = AS factorization by formulating the advantage for the further analysis. problem as finding a sparse non-negative T such that X ≈ XT As another example, in [7] non-negative matrix factorization and as many rows of T as possible are entirely zero. We want is applied to spectrograms of many different musical sounds in to encourage this so called row sparsity of T in order to select order to obtain a spectral basis for musical genre classification. as few as possible data points as dictionary atoms. We do this Again the physical fidelity of our approach could be interesting 3 since it would provide a correspondence of each spectral basis A at each pixel in X. The application of NMF to hyperspectral element to a certain musical sound. endmember detection can for instance be found in [8], [9]. From the numerous potential applications, we concentrate on two to illustrate the proposed framework. We describe next Considering that while material mixtures in HSI exist, it the challenges in these applications. This early description of is unlikely that pixels contain a mixture of all or many these applications will help to further motivate the work. of the materials in A, researchers have recently focused on 1) Introduction to hyperspectral imaging (HSI): HSI sen- encouraging sparsity on the abundance matrix S [20], [21]. sors record up to several hundred different frequencies in Motivated by the ideas of dictionary learning for sparse the visible, near-infrared and infrared spectrum. This precise coding, the works [22], [23] proposed explicitly to look for spectral information provides some insight on the material at endmember matrices that lead to sparse abundance maps. We each pixel in the image. Due to relatively low spatial resolution follow a similar idea, though our method will be fundamentally and the presence of multiple materials at a single location (e.g., different in two aspects: First, we restrict columns of our tree canopies above ground or water and water sediments), dictionary A to appear somewhere in the data X. This is a many pixels in the image contain the mixed spectral signatures common working hypothesis for moderate ground sampling of multiple materials. The task of determining the abundances distance images and is called pixel purity assumption. It corre- (presence quantity) of different materials in each pixel is called sponds to the partial orthogonality assumption on S discussed spectral unmixing. This is clearly an ill-posed problem that previously. In the general context of dictionary learning and requires some assumptions and data models. non-negative matrix factorization it guarantees the columns Unmixing requires a dictionary with the spectral signatures of A to be physically meaningful. As mentioned above, the of the possible materials (often denoted as endmembers). Since lack of physical interpretation has been a critical shortcoming these dictionaries can be difficult to obtain and might depend of standard dimensionality reduction and dictionary learning on the conditions under which they were recorded, it is some- techniques, and not yet addressed in these areas of research. times desirable to automatically extract suitable endmembers Second, choosing the dictionary columns from the data enables from the image one wants to demix in a process called end- us to propose a convex model and hence avoid the problem member detection. Many different techniques for endmember of saddle points or local minima. detection have been proposed including variational [8], [9], [10] and Bayesian [11], [12], [13] models as well as convex Different areas of applications use different terminology for geometry-based methods [2], [14], [1]. Related although not mathematically similar things. Throughout this paper we will yet applied to endmember detection are subset selection meth- use the terminology of hyperspectral unmixing to explain and ods like the rank-revealing QR (RRQR) decomposition (e.g. motivate our model, although its application is not exclusively [15], [16]) and greedy dictionary selection methods [17], [18]. in the field of HSI. For instance, an endmember could be any The general principle behind rank-revealing QR for subset se- abstract column of the dictionary matrix A with a different lection is to find a column permutation of the data matrix such physical meaning depending on its context. However, we think that the first few columns are as well conditioned as possible it aids the clarity and understanding of the model to use the [16]. In a sense, it looks for the most linearly independent HSI example throughout the explanations. To show that our 1 columns of a matrix. However, unlike our approach, QR model is not limited to the hyperspectral case we also present methods do not take non-negativity constraints into account. results for the problem of blind source separation, which we The greedy dictionary selection methods [17], [18] are based will briefly describe in the next subsection. on greedy approximations of combinatorial optimization problems for sparse dictionary selection. These methods come with theoretical guarantees given incoherence assumptions about 2) Introduction to blind source separation (BSS): Blind the data matrix. Such assumptions do not hold for our data, source separation is the general problem of recovering un- but because we have a nonnegativity constraint not assumed known source signals from measurements of their mixtures, there, we can show our model is theoretically sound without where the mixing process is unknown but often assumed to be strong incoherence assumptions on the data, as demonstrated linear. Examples include demixing audio signals and nuclear by the exact convex relaxation discussed in Section III. magnetic resonance (NMR) spectroscopy. Many BSS prob- Simultaneously detecting endmembers and computing abun- lems have the same linear mixing model X = AS that we are dances can be stated as factoring the data matrix X ∈ Rm,d using for hyperspectral endmember detection and unmixing, into X ≈ AS, A, S ≥ 0, with both A ∈ Rm,n and S ∈ Rn,d often also with a nonnegativity constraint on S [1]. Here the being unknown. In this notation each column of X is the rows of S represent the source signals, A is the unknown spectral signature of one pixel in the image. Hence, m is the mixing matrix and the rows of X are the measured signals. number of spectral bands, d is the total number of pixels, and The analogy of the hyperspectral pixel purity assumption can each of the n columns of A represents one endmember. The also be applied to some BSS problems [1]. The interpretation abundance matrix S contains the amounts of each material in is that for each source there is some place where among all the sources only that source is nonzero. Thus up to scaling, 1Independent of the work here described, Laura Balzano, Rob Nowak the columns of A should appear somewhere in the columns and Waheed Bajwa developed a related matrix factorization technique and connected and compared it to RRQR. We thank Laura for pointing out their of the data matrix X and we can use the same algorithm we work [19] and also the possible relationships with RRQR. use for endmember detection. 4

II.CONVEXENDMEMBERDETECTIONMODEL this second type of sparsity (see [24] for a model combining A. Convexification of matrix factorization using the pixel structured and collaborative sparsity with individual sparsity) purity assumption . As mentioned above, we are assuming at first that the endmembers can be found somewhere in the data X. This B. Data reduction as preprocessing assumption will enable us to propose a convex model for It is already clear from Equation (2) that the problem is factoring the data X into a product AS, a problem usually too large because the unknown matrix T is a d × d matrix, tackled by non-convex optimization techniques. We assume where d is the number of pixels. Thus, before proceeding that there exists an index set I such that the columns Xi of with the proposed convex model, we reduce the size of the X are endmembers for i ∈ I. Under the assumption of non- problem by using clustering to choose a subset of candidate negative linear mixing of signals, this means that any column endmembers Y from the columns of X and a submatrix Xj in X can be written as m×ds Xs ∈ R of X for the data with ds ≤ d. In other words, X we want to reformulate the problem as YT ≈ X , where Xj = XiTi,j, (1) s m×ds m×nc nc×ds i∈I Xs ∈ R , Y ∈ R , T ∈ R , nc << d and d ≤ d. We use X = Y in our experiments but could also for coefficients T ≥ 0. The problem is that the coefficients s s i,j include more or even all of the data. We use k-means with T as well as the index set I are unknown. Hence, we start i,j a farthest-first initialization to select Y . Other data reduction by using all columns in X to describe X itself, i.e. we look procedures could be used. An angle constraint hY ,Y i < .995 for coefficients T ≥ 0 for which i j ensures the endmember candidates, namely the columns of X = XT. (2) Y , are sufficiently distinct. An acceptable angle constraint could probably be estimated from assumptions about the noise Notice that Equation (2) has many solutions. However, we and the mixing, based on the principle that slightly perturbed know that the desired representation uses as few columns of versions of the same endmember shouldn’t be in Y . However, X as possible, i.e., only the endmembers. Since not using the th we found it much easier to simply treat the angle constraint as j column in the above formulation corresponds to having .995 th a parameter, and worked well for all the HSI experiments. the entire j row of T be zero, we can reformulate the An upper bound is placed on the number of allowable clusters endmember detection problem as finding a solution to (2) such so that the size of the problem is reasonable. We then propose that as many rows of T as possible are zero. Mathematically, a convex model for the more manageable problem of finding

min kT krow-0 such that XT = X, (3) a nonnegative T such that YT ≈ Xs, with T having the T ≥0 same sparsity properties described above. Note that we have where kT krow-0 denotes the number of non-zero rows. The not convexified the problem by pre-fixing the dictionary Y . columns of X that correspond to non-zero rows of the mini- This is done simply to work with manageable dimensions and mizer T of (3) are the desired endmembers (which define the datasets. Our convex model will still select the endmembers lowest dimensional subspace where the data resides). Since as a subset of this reduced dataset Y , namely the columns of problem (3) is not convex we relax the above formulation Y that will correspond to non-zero rows of T . by replacing kT krow-0 by the convex l1,∞ norm kT k1,∞ = P max |T |. The l part of the norm should encourage i j i,j 1 C. The endmember selection model sparsity. Combined with the maximum norm we can expect that the vector of maxima among each row becomes sparse Our model consists of a data fidelity term and two terms which means we obtain row sparsity. Notice that we could have that encourage the desired sparsity in T . For simplicity, we chosen other row-sparsity encouraging regularizers such as consider the data fidelity term l1,2. However, we will prove in Section III the l1,∞ relaxation β k(YT − X )C k2 , (4) is exact in the case of normalized, non-repeating data, which 2 s w F makes it clearly preferable. As mentioned in the previous section, it is important to where k · kF denotes the Frobenius norm, β is a positive ds×ds take noise into account and therefore the equality (2) is too constant, and Cw ∈ R is a diagonal matrix we can use to restrictive for real data. Hence, we will introduce a parameter weight the columns of (YT −Xs) so that it reflects the density that negotiates between having many zero rows (also called of the original data. As mentioned earlier we encourage rows 2 of T to be zero by penalizing with ζkT k1,∞, which for non- row sparsity) and good approximation of the data kX −XT kF P in the Frobenius norm. Furthermore, for T to be a good negative T equals ζ i maxj(Ti,j), with ζ a positive constant, solution, not only should most of its rows be zero, but the so that only a few samples are cooperatively selected as 2 nonzero rows should also be sparse, since sparsity of the co- endmembers. This kind of collaborative/structured sparsity efficient matrix reflects physically reasonable prior knowledge. regularizer has been proposed in several previous works, for For hyperspectral unmixing the additional sparsity requirement example [4], [25]. on T reflects the assumption that most pixels are not mixtures 2The work mentioned above by Balzano, Nowak and Bajwa uses block of all the selected endmembers, but rather just a few. Thus, orthogonal matching pursuit and also mentions the possible use of k · k2 we will add an additional weighted l1 term to incorporate instead of k · k∞. 5

To encourage sparsity of the nonnegative T , we use a for the alternating minimization approach to NMF, which can weighted l1 norm, hRwσCw,T i. Here Rw is a diagonal matrix be used to further refine the solution if desired. Letting A˜ be of row weights. We choose Rw to be the identity in our the endmembers selected by the convex model, the solution is experiments, but if for example we wanted to encourage refined by alternately minimizing selection of endmembers towards the outside of the cone 1 2 containing the data, we could also choose these weights to be min kAS − X||F + hRwσ, Si (7) A≥0,S≥0,||A −A˜ k 0. Note that it is faster to precompute and store the inverse D. Refinement of solution involved in the update for Z, but this can be overly memory intensive if Y has many columns and C 6= I. One could use Since we are using a convex model to detect endmem- w a more explicit variant of ADMM such as the method in [28] bers, it cannot distinguish between identical or very similar to avoid this difficulty if a larger number of columns of Y endmember candidates, which we will discuss from a more is desired. Here we have restricted this number n to be less theoretical point of view in Section III. However, the model c than 150 and ADMM can be reasonably applied. works very reliably when the columns of Y are sufficiently Also note that the minimization problem for T k+1, distinct, which they are by construction. A limitation is that the convex model is unable to choose as endmembers any k+1 X T = arg min g≥0(T ) + ζ max(Ti,j) Y T j data not represented in . Nonetheless, as is shown in Section i IV, the results of this approach already compare favorably to k δ P RwσCw other methods. Even if some endmembers are not in Y , which + kT − Zk+1 − + k2 , (8) 2 δ δ F could be due either to the data reduction preprocessing or to a failure of the pixel purity assumption, the underlying problem decouples into separate problems for each row Ti. The Legen- of representing the data in terms of the available endmember dre transformation of g≥0(Ti) + ζ maxj(Ti,j) is the indicator ds candidates is still well posed. The convex model clearly cannot function for the set Cζ = {Pi ∈ R : k max(Pi, 0)k1 ≤ ζ}. k ˜k+1 k+1 P RwσCw recover endmembers that are not present in Y , but it should Let T = Z + δ − δ . Then by the Moreau still recover those that are present and represent the data as decomposition [29], the minimizer T k+1 is given by sparse nonnegative linear combinations of a small number of T k+1 = T˜k+1 − Π (T˜k+1), columns of Y . Moreover, it provides an excellent initialization C ζ δ 6

˜k+1 ˜k+1 where ΠC ζ (T ) orthogonally projects each row of T on I, which means it is a minimizer to the row-0 problem (3). δ ˆ onto C ζ , which is something that can be computed with Vice versa, any row-0 minimizer T has exactly one entry equal δ complexity O(d log(d )). to one in any row corresponding to an endmember and zero s s ˆ ˆ The other algorithm parameters we use are δ = 1, ζ = 1, rows elsewhere, which means kT k1,∞ = |I|, and hence T also β = 250, ν = 50, and h = 1 − cos(4π/180). In our minimizes the l1,∞ norm under the XT = X constraint. We therefore have shown the following lemma: experiments, we also choose Xs = Y . We then define column weights Cw that weight each column j by the number of Lemma III.1. If we remove repeated columns of X and have pixels in the jth cluster (the cluster centered at Yj) divided normalized data, the sets of minimizers of by the total number of pixels d. To refine the solution of min kT krow-0 such that XT = X (11) the convex model, we note that each alternating step in the T ≥0 minimization of (7) is a convex minimization problem that can again be straightforwardly minimized using ADMM and and its variants. The update for the abundance S is identical to the min kT k1,∞ such that XT = X (12) split Bregman algorithm proposed for hyperspectral demixing T ≥0 in [20], [21], and its connection to ADMM is discussed in are the same. [30]. Notice that while generally there are other regularizations, like for instance l1,2, which would also encourage row sparsity, III.THECONNECTIONBETWEENROW-0 AND l1,∞ this lemma only holds for l1,∞. For XT = X, the l1,∞ In this section we will show that the motivation for our norm counts the number of rows and is not influenced by model comes from a close relation between the convex l1,∞ any value that corresponds to a mixed pixel. This property is norm and the row-0 norm, which counts the number of non- unique for l1,∞ regularization with a non-negativity constraint zero rows. and therefore makes it the preferable choice in the proposed framework. A. Distinct noise-free data Let us assume we have data X which is completely noise B. Noise in the data free, normalized, obeys the linear mixing model, and contains Of course the assumptions above are much too restrictive a pure pixel for each material. Under slight abuse of notation for real data. Therefore, let us look at the case of noisy data let us call this data after removing points that occur more of the form Xδ = X +N, where X is the true, noise-free data than once in the image, X again. As discussed in Section II with no repetition of endmembers as in the previous section, the endmember detection problem can now be reformulated and N is noise bounded in the Frobenius norm by kNkF ≤ δ. as finding the minimizer T of (3), where the true set of We now consider the model endmembers can then be recovered as the columns of X that 2 correspond to non-zero rows of T . Jα(T ) = kXT − XkF + αkT k1,∞ such that T ≥ 0. (13) The fact that problem (3) gives a solution to the endmember selection problem is not surprising, since (3) is a non-convex The following lemma shows that for the right, noise-dependent problem and related problems (for instance in compressed choice of regularization parameter α, we converge to the sensing) are usually hard to solve. However, due to the non- correct solution as the noise decreases. negativity constraint we will show that l1,∞ minimization is Lemma III.2. Let Tˆ be a non-negative k · k1,∞-minimum an exact relaxation of the above problem. norm solution of XT = X, Xδ = X + N be noisy data with δ For any T ≥ 0 with XT = X the entries of T are less than kNkF ≤ δ and Tα denote a minimizer of energy functional 3 δ or equal to one, Ti,j ≤ 1. Furthermore, the endmembers can (13) with regularization parameter α and replacing X by X . only be represented by themselves which means that in each If α is chosen such that row with index i, i ∈ I, we have a coefficient equal to 1. We δ2 can conclude that the l1,∞ norm of any T ≥ 0 with XT = X α → 0, → 0 as δ → 0 (14) is α δ d then there exists a convergent subsequence T n and the limit X αn kT k1,∞ = max Ti,j, (9) of each convergent subsequence is a k · k1,∞-minimum norm j i=1 solution of XT = X. X ≥ max Ti,j = |I|. (10) j For the sake of clarity we moved the proof of this lemma i∈I to the appendix. Lemma III.2 shows that our regularization is However, it is possible to have equality in the above estimate stable, since for decreasing noise and appropriate choice of the if and only if Ti,j = 0 for i∈ / I. In this case the rows of the parameter α we converge to a non-negative k · k1,∞-minimum non-negative l1,∞ minimizer of XT = X are only supported norm solution of XT = X which - as we know from the first part - gives the true solution to the endmember detection 3This is a simple fact based on the normalization and non-negativity, 1 = problem when the columns of X are distinct. While identical T,X≥0 P kXkk = k i Ti,kXik ≥ maxi kTi,kXik = maxi Ti,k points are easy to identify and eliminate in the noise free 7 data, determining which points belong to the same endmember We can see that our method gives the best average in the noisy data can be very difficult. This is the reason performance. Due to a high noise level, methods that rely for our first data reduction step. Lemma III.2 tells us that on finding the cone with maximal volume or finding most for our method to be closely related to the row-0 approach linearly independent vectors, will select outliers as the we have to avoid having several noisy versions of the same endmembers and do not yield robust results. Looking at endmember in X. We therefore impose an angle constraint the minimal and maximal α we see the effect predicted. as described in Section II-B while clustering the data, which The non-convex methods like alternating minimization basically corresponds to an upper bound on the noise and states and VCA can outperform our method on some examples up to which angle signals might correspond to the same point. giving angles as low as 1.76. However, due to the non- convexity they can sometimes find results which are far IV. NUMERICAL RESULTS FOR HSI off the true solution with deviation angles of 6.95 or even In this section we present numerical results on endmem- 8.17 degrees. A big benefit of our convex framework is ber detection for supervised and real hyperspectral data and that we consistently produce good results. The difference compare our results to existing detection algorithms. Since the between the best and the worst reconstruction angle goal of the paper is to present a general new convex framework deviation for our method is 0.15 degrees with and 0.17 for matrix factorization which is applicable to multiple areas, without the refinement, which underlines its high stability. the hyperspectral numerical results are intended mainly as a Method Evaluation on 15 test runs proof of concept. It is therefore encouraging that our method Avg. α Min. α Max. α is competitive with some of the established methods it is Ours refined 3.37 3.30 3.42 compared to in the examples below. Ours without refinement 3.93 3.84 4.01 VCA 4.76 1.78 6.95 A. Application to blind hyperspectral unmxing N-findr 10.19 7.12 13.79 1) Supervised endmember detection: For comparison pur- QR 9.87 4.71 12.74 poses we extracted nine endmembers from the standard Indian Alt. Min. 4.50 1.76 8.17 pines dataset (publicly available at https://engineering.purdue. TABLE I: Deviation angle from true endmembers edu/biehl/MultiSpec/hyperspectral.html) by averaging over the corresponding signals in the ground truth region. Then we The computational complexity of our method is highly created 50 data points for each endmember, 30 data points for dependent on the amount of data reduction. There is a each combination of two different endmembers, 10 data points trade-off here since more data reduction speeds up the for each combination of three different endmembers, and addi- computation but limits the number of endmember candidates. tionally 30 data points as mixtures of all endmembers. Finally, For instance, in the above experiment our algorithm (including we add Gaussian noise with zero mean and standard deviation the clustering with 10 k-means iterations and the refinement 0.006, make sure our data is positive, and normalize it. We step) takes about 100 seconds. This run-time comes from evaluate our method in a comparison to N-findr [2], vertex the clustering selecting 250 endmember candidates due to a component analysis (VCA) [14], [31] with code from [32], rather large noise level, which can make many similar signals an NMF method using the alternating minimization scheme look much more distinct than they really are. Reducing the of our refinement step with random initial conditions, and noise level by a factor of 1/2, the clustering only selects the QR algorithm. For the latter we simply used MATLAB’s between 90 and 100 endmember candidates, leading to a QR algorithm to calculate a permutation matrix Π such that much faster run-time of about 13 seconds. In comparison, the XΠ = QR with decreasing diagonal values in R and chose the alternating minimization approach needs about 45 seconds in first nine columns of XΠ as endmembers. Many other methods the case of strong noise and about 29 seconds in the case could also be applied here, including the additional variational, of moderate noise. In our approach, the upper bound on the Bayesian and greedy dictionary selection methods referenced number of columns of Y can be adjusted according to how in the introduction. However, comparisons to the simple, well much of an issue computational resources are. established N-findr, VCA, QR and alternating minimization 2) Results on real hyperspectral data: To show how our algorithms suffice to demonstrate that our proposed method method performs on real hyperspectral data we use the urban works. Since the success of non-convex methods depends on image (publicly available at www.tec.army.mil/hypercube). the initial conditions or on random projections, we run 15 Figure 3 shows the RGB image of the urban scene, the spectral tests with the same general settings and record the average, signatures of the endmembers our method extracted, and the maximum and minimum angle by which the reconstructed abundance maps of each endmember, i.e., each row of T endmember vectors deviate from the true ones, see Table I. written back into an image. Our method, unlike the others in Table I, does not require The abundance maps are used in HSI to analyze the dis- the number of endmembers to be specified in advance. For tribution of materials in the scene. First of all our method the tests we adjusted the parameters of our method to obtain managed to select six endmembers from the image, which 9 endmembers. Figure 2 shows the original endmembers as is a very reasonable dimension reduction for hyperspectral well as an example reconstructions by each method with the image analysis. We can see that the abundance maps basically corresponding angle of deviation. segment the image into clusters of different material categories 8

Fig. 2: Comparison of endmember recostruction methods such as concrete, house roofs, soil or dirt, grass, and two corresponding endmember signals each method found. different types of vegetation, which all seem to be reasonable when visually comparing our results to the RGB image. The Ours N-findr VCA QR Alt. Min. kA S − Xk2 533.9 4185.8 2516.5 1857.6 454.3 spectral signatures our method selected are relatively distinct, m m kSmk0/(d · n) 0.40 0.60 0.47 0.60 0.41 but do not look noisy. Furthermore, the abundance maps seem to be sparse which was a goal of our method and reflects the TABLE II: Comparison of different endmember detection physically reasonable assumption that only very few materials methods in terms of error and sparsity of the data projection should be present at each pixel. onto the cone spanned by the endmembers on the urban image. The corresponding endmember signatures are shown in Figure As a next step we compare our results to the ones obtained 4. by N-findr, VCA, QR and alternating minimization. Unfortu- nately, we have no ground truth for the urban image, which We can see from the projection error that the sparse al- is why we will look at the non-negative least squares (NNLS) ternating minimization approach and our method found much unmixing results based on the endmembers Am each method better representation for the data than N-findr, VCA and QR, found. Notice that geometrically NNLS gives the projection with the alternating minimization performing slightly better of the data onto the cone spanned by the endmembers. If than our approach. Furthermore, the sparsity of the projection Sm denotes the NNLS solution of each method, the error is also higher which indicates more reasonable choices for 2 kAmSm − XkF gives some insight on how much data is the endmembers. Looking at the spectral signatures in Figure contained in the cone and therefore, how well the endmembers 4 we can speculate why this is the case. QR as well as describe the data. However, as discussed earlier, due to noise N-findr chose very distinct signatures, which are probably we might not be interested in detecting the pixels furthest far outliers in the dataset. Some of the VCA endmembers outside to be endmembers, although they might describe the look even more extreme and take negative values, which data better. Thus, we will also report the sparsity of each is clearly unphysical and does not allow these endmembers cone projection Sm. Since any outside point will be projected to be interpreted as any material. On the other hand the onto a face or an edge of the cone, the sparsity will give endmembers of the alternating minimization approach and of some insight into whether the endmember vectors are well our method are very similar and look comparably smooth. located. The more the endmembers are towards the inside of The average angle of deviation between our method and the a point cluster, the more points we expect to be projected alternating minimization approach is only 3.4 degrees, which onto an edge of the cone rather than a face. Thus, a high lets us conclude that they basically converged to the same sparsity relates to a reasonable position of an endmember. answer, which is encouraging considering the fact that these Table II shows the relative number of non-zero coefficients, endmembers describe the rest of the data more than three times i.e., kSmk0 divided by the the total number of entries in more accurately than the endmembers found by other methods. Sm, as well as the projection error for N-findr, QR, VCA, Furthermore, any signal our method selected differs at most by alternating minimization and our method. Figure 4 shows the 0.036 degrees from an actual point in the data and is therefore 9

Fig. 3: Results on the urban hyperpectral image. First row, left: RGB image to compare the fraction planes to. First row, middle left: spectral signatures of the endmembers our method found. First row middle right, right and second row: abundance maps of the six endmembers. physically meaningful.

V. ANEXTENDEDMODEL We also propose an extended version of the model that takes into account the normalization of the data in order to better distinguish between noise and outliers. We show this slightly more complicated functional can still be efﬁciently solved using convex optimization.

A. Error model We continue to work with the reduced set of endmember candidates Y and a subset of the data Xs, which in practice 2 Fig. 5: Region of possible values for X + V we take to be Y . Instead of penalizing k(YT −Xs)CwkF , we j j impose the linear constraint YT −Xs = V −Xs diag(e), where T ,V and e are the unknowns. T has the same interpretation as before, V ∈ Rm×ds models the noise, and e ∈ Rds models represented as a small error plus a sparse nonnegative linear the sparse outliers. Since the columns of Xs are normalized to combination of other data, but that we also don’t want to have unit l2 norm, we would also like most of the columns of YT to be approximately normalized. In modeling the noise V , consider as endmembers. Given some γ ≥ 0, this sparse error is modeled as −Xs diag e with e restricted to the convex we therefore restrict it from having large components in the P set E = {e : e ≥ 0 and (Cwe)j ≤ γ}. Since E is direction of Xs. To approximate this with a convex constraint j the nonnegative region of a weighted l1 ball, the orthogonal V ∈ D, we restrict each column Vj to lie in a hockey puck k k projection onto E can be computed with O(d log(d )) com- shaped disk D . Decompose V = V ⊥ + V , where V s s j j j j j plexity. Here, since the weights w sum to one by deﬁnition, is the orthogonal projection of V onto the line spanned by j j γ can be roughly interpreted as the fraction of data we expect X and V ⊥ is the radial component of V perpendicular to j j j to be outliers. For non-outlier data X , we want e ≈ 0 k ⊥ j j V . Then given 0 ≤ rj < 1, we restrict kV k2 ≤ rj and j j and for outlier data we want ej ≈ 1. In the latter outlier q 2 k 1 − rj − 1 ≤ Vj ≤ 0. The orthogonal projection onto this case, regularization on the matrix T should encourage the set is straightforward to compute since it is a box constraint corresponding column Tj to be close to zero, so kYTjk2 is in cylindrical coordinates. This constraint set for Vj is shown encouraged to be small rather than close to one. in Figure 5 in the case when ej = 0. Keeping the l1,∞ regularization, the nonnegativity constraint We also allow for a few columns of the data to be outliers. and the weighted l1 penalty from Equation (6), the overall These are columns of X that we don’t expect to be well extended model is given by 10

Fig. 4: Spectral signatures of endmembers extracted by different methods. Top row: results of our method and the alternating minimization approach. Bottom row: endmembers found by N-ﬁndr, QR and VCA

where gD and gE are indicator functions for the V ∈ D and X e ∈ E constraints. min ζ max(Ti,j) + hRwσCw,T i (15) Using the ADMM-like method in [34], a saddle point of T ≥0,Vj ∈Dj ,e∈E j i the augmented Lagrangian can be found by iteratively solving such that YT − Xs = V − Xs diag(e). the following subproblems with parameters δ > 0 and µ > 2, The structure of this model is similar to the robust PCA model k proposed in [33] even though it has a different noise model k+1 I T Z = arg min Z − k k Z Y and uses l1,∞ regularization instead of the nuclear norm. V − Xs diag(e ) + Xs 2 1 P k + 1 P k B. Numerical optimization δ 2 F k+1 X T = arg min g≥0(T ) + ζ max(Ti,j) Since the convex functional for the extended model (15) is T j slightly more complicated, it is convenient to use a variant i 2 of ADMM that allows the functional to be split into more δµ k 1 k k P1 RwσCw + T − T − (Z − T ) − + than two parts. The method proposed by He, Tao and Yuan 2 µ δµ δµ F in [34] is appropriate for this application. Again introduce a k+1 δµ k V = arg min gD(V ) + V − V new variable Z and the constraint Z = T . Also let P1 and V 2 2 P2 be Lagrange multipliers for the constraints Z − T = 0 k 1 k+1 k k P2 and YZ − V − X + X diag(e) = 0 respectively. Then the − (YZ − V + Xs diag(e ) − Xs) − s s µ δµ augmented Lagrangian is given by F

ds m Lδ(Z, T, V, e, P1,P2) =g≥0(T ) + gD(V ) + gE(e) k+1 δµ X k 1 X e = arg min gE(e) + ej − ej + X e 2 δµ + ζ max(Ti,j) + hRwσCw,T i j=1 i=1 j i 2 (X ) (P k + δ(YZk+1 − V k + X diag(ek) − X )) + hP1,Z − T i s i,j 2 s s i,j k+1 k k+1 k+1 + hP2,YZ − V − Xs + Xs diag(e)i P1 = P1 + δ(Z − T ) δ k+1 k k+1 k+1 k+1 + kZ − T k2 P2 = P2 + δ(YZ − V − Xs + Xs diag(e )) 2 F Each of these subproblems can be efficiently solved. There δ 2 + kYZ − V − X + X diag(e)k , k+1 k+1 2 s s F are closed formulas for the Z and V updates, and the 11 ek+1 and T k+1 updates both involve orthogonal projections η = 0.08, γ = 0.01, ν = 40 and run the basic model with that can be efficiently computed. ζ = 1.3, ν = 40, β = 250. Figure 7 shows the results for both methods with their average angle of deviation from the true C. Effect of extended model endmembers. We can see that both models give good results close to A helpful example for visualizing the effect of the extended the ground truth. Due to the Gaussian noise we added, the model (15) is to apply it to an RGB image. Even though best possible choice of columns of X deviates by an average the low dimensionality makes this significantly different from angle of 3.29 degrees from the true endmembers. With the hyperspectral data, it’s possible to view a scatter plot of the refinement step, both methods could achieve an average angle colors and how modifying the model parameters affects the below this value. The main remaining deviation is mainly due selection of endmembers. The NMR data in Section V-E is to the methods selecting an almost straight line rather then the four dimensional, so low dimensional data is not inherently corresponding true endmember. The similar results show that unreasonable. the extended model can be made to perform like the basic one. For the following RGB experiments, we use the same The extended model has the advantage, though, of being able parameters as described in Section II-E and use the same k- to ignore a specified fraction of outliers. means with farthest first initialization strategy to reduce the As a second experiment we simulate outliers in the data. size of the initial matrix Y . We do not however perform the First we again create a mixed pixel data set with nine end- alternating minimization refinement step. Due to the different members as above, but without adding Gaussian noise. Next, algorithm used to solve the extended model, there is an we create a spike signal and add the spike itself as well as additional numerical parameter µ, which for this application about 3% data that is mixed with the spike signal. This shall must be greater than two according to [34]. We set µ equal to simulate a small fraction of the data being outliers. Again, 2.01. There are also model parameters r and γ for modeling j we run the basic as well as the outlier model on this dataset the noise and outliers. To model the small scale noise V , and obtain the results shown in Figure 8. The upper left image we set r = η + m , where η is fixed at .07 and m is j j j shows the true nine endmembers plus the spike signal we used the maximum distance from data in cluster j to the cluster to create the outliers with. As we can see in the image on the center Y . To model the sparse error e, we use several different j upper right, which shows the result of the basic model, the values of γ, which should roughly correspond to the fraction algorithm selected the spike as an endmember. This is very of data we are allowed to ignore as outliers. We will also use reasonable because although only 3% of the data contains parts different values of ν, setting ν = 50 to encourage a sparser of the spike, it is a strong outlier and hence expensive for the abundance matrix and setting ν = 0 to remove the weighted fidelity term in the Frobenius norm to exclude. The shown ten l penalty from the model. Figure 6 shows the data, which is a 1 detected endmembers deviate only by 3.6 degrees from the color image of multicolored folders, the selected endmembers nine true endmembers and the spike. (black dots) from four different experiments, and the sparse The second row of Figure 8 shows the nine true endmembers abundance matrix T for one of the experiments. on the left and the result of the outlier model on the right. As Note that in the ν = 0 experiment, sparsity of T is not we can see the outlier model was able to handle the small encouraged, so the dense gray cluster in the center of the cone, fraction of signals containing the spike and only selected the which corresponds to the floor in the image, is not selected nine true endmembers. The average angle of deviation in this as an endmember. Additionally, the selected endmembers are case is 2.7 degrees and we can confirm that the model behaves towards the very outside of the cone of data. In all the like we expected it to. other experiments where ν = 50 the floor is selected as an endmember even though it is in the center of the cone of data. This results in a much sparser matrix T . Moreover, the selected E. Application to blind source separation of NMR data endmembers tend to be in the center of the clusters of colors To illustrate how our model can be applied to BSS problems, that we see in the scatter plot of the folder data. Finally we we use it to recover the four NMR source spectra from ([1] γ note that as we increase the parameter , fewer endmembers Fig. 4) from four noise free mixtures. The four sources are are selected and some of the smaller outlying color clusters 4×5000 shown in Figure 9. Let S0 ∈ R be the sources and let are ignored. the mixtures X0 be generated by X0 = A0S0 with   D. Comparison between the base and the outlier model .3162 .6576 .3288 .5000 .3162 .3288 .6576 .5000 To illustrate the difference between the outlier and the base A = . 0 .6325 .1644 .1644 .5000 model we use the same dataset as for the supervised endmem-   .6325 .6576 .6576 .5000 ber detection experiment and first repeat the experiment from Section IV-A1 with 30 data points for each endmember, 20 We will use the outlier model to recover the mixing matrix data points for each combination of two different endmem- A from X0. Unlike the hyperspectral examples, some columns bers, 10 data points for each combination of three different of X0 here can be nearly zero if all sources are simultaneously endmembers, and additionally 30 data points as mixtures of zero at the same spectral index. We can see from Figure all endmembers. We add Gaussian noise with zero mean and 9 that this is indeed the case. Since our algorithm uses standard deviation 0.005, run the outlier model with ζ = 1, normalized data, we first remove columns of X0 whose norm 12

original image ν = 0, γ = 0 ν = 50, γ = .005

T for ν = 50, γ = .005 ν = 50, γ = .01 ν = 50, γ = .1

Fig. 6: Results of the extended model applied to RGB image. Upper left: RGB image we apply the blind unmixing algorithm to. Upper middle: 3d plot of the data points in the image in their corresponding color. Shown as black dots are the endmembers detected without allowing outliers (γ = 0) and without encouraging particular sparsity on the coefficients (ν = 0). Upper right: With allowing some outliers the method removed an endmember in the one of the outside clusters, but included the middle cluster due to the encouraged sparsity. Lower left: Endmember coefficients for the parameter choice ν = 50, γ = 0.005, where the brightness corresponds to the coefficient value. We can see that the coefficient matrix is sparse. Lower middle: Increasing the allowed outliers the red cluster endmember is removed. Increasing the outliers even further leads to decreasing the number of endmembers to 4.

Fig. 7: Comparison between the basic and the outlier method on Indian pines data with Gaussian noise

is below some threshold, which we take to be .01 maxj kXjk. straightforward to do and could be automated. For example, to We then normalize the remaining columns to get X. This choose a smaller dictionary, we can reduce ν and/or increase γ. simple approach suffices for this example, but in general the The parameters used here are identical to those used in the parameters rj for the V ∈ D constraint could also be modified RGB experiments of Section V-C except γ = .01 and ν = 5. to account for columns of X0 that have significantly different Also, for the data reduction step, the angle constraint was norms. increased to hYi,Yji < .998. The computed mixing matrix after permutation is A minor difficulty in applying our method to this BSS problem is that we know A should have four columns but .3267 .6524 .3327 .4933 there is no way to constrain the algorithm to produce a .3180 .3300 .6544 .5110 A =   . dictionary with exactly four elements. We therefore adjust .6228 .1757 .1658 .4836 parameters until the dimension of the result is correct. This is .6358 .6593 .6585 .5114 13

Fig. 8: Comparison between the basic and the outlier method on Indian pines data with outliers

Note that the columns of A are normalized because the provided important comments and feedback. We would also algorithm selects dictionary elements from a normalized ver- like to thank Nicolas Gillis, John Greer and Todd Wittman for sion of the data X. Since for this simple problem, A is helpful discussions about endmember detection strategies, and invertible, it is straightforward to recover the sources by Yuanchang Sun for his advice on BSS problems. −1 S = max(0,A X0). More generally, we can recover S by minimizing the convex functional in Equation (7) with respect APPENDIX to S using the un-normalized data matrix X0 and the computed PROOFOF LEMMA III.2 δ endmembers A. Proof: Since Tα is a minimizer of Jα we can conclude 1 kT δk ≤ J (T δ) VI.FUTURERESEARCH α 1,∞ α α α We have presented a convex method for factoring a data 1 ≤ Jα(Tˆ) matrix X into a product AS with S ≥ 0 under the constraint α that the columns of A appear somewhere in the data X. This ˆ 1 ˆ 2 = kT k1,∞ + k(X + N)T − (X + N)kF type of factorization ensures the physical meaning of the dic- α ˆ 1 ˆ ˆ 2 tionary A, and we have successfully applied it to hyperspectral = kT k1,∞ + k (XT − X) +(NT − N)kF endmember detection and blind source separation in NMR. α | {z } =0 For non-repeating noise free data, the l1,∞ regularization was ˆ 1 2 ˆ 2 proven to be an exact relaxation of the row-0 norm. We further ≤ kT k1,∞ + kNk kT − Idk α F F proposed an extended model that can better handle outliers. 2 δ 2 Possible future application areas include computational biol- = kTˆk1,∞ + kTˆ − Idk . (16) α F ogy, sensor networks, and in general dimensionality reduction δ2 δ and compact representation applications where the physical Since α → 0, Tα is bounded in the k · k1,∞ norm and T δn interpretation of the reduced space is critical. It will also be therefore has a convergent subsequence. Let αn denote such ¯ interesting to try and extend our convex model to the class a convergent subsequence and let T be its limit. Because T δn ≥ 0 T¯ ≥ 0 of problems discussed in [35] for which the pixel purity or αn we also have . We can now use the above non-overlapping assumption is approximately but not exactly estimate for showing satisﬁed. ¯ δn kT k1,∞ = lim kTα k1,∞ Acknowledgments - We would like to acknowledge Laura n n 2 Balzano and Rob Nowak, who are working on a similar ˆ δn ˆ 2 ≤ lim kT k1,∞ + kT − IdkF selection framework without non-negativity constraints, par- n αn tially motivated by problems in sensor networks, and had = kTˆk1,∞. (17) 14

Fig. 9: Four NMR source spectra from [1] Fig. 4

Furthermore, we have [8] L. Miao and H. Qi, “Endmember extraction from highly mixed data using minimum volume constrained nonnegative matrix factorization,” ¯ 2 δn δn δn 2 kXT − XkF = lim kX Tα − X kF IEEE Trans. Geosci. Rem. Sens., vol. 45, no. 3, pp. 765–777, 2007. n n [9] M. Berry, M. Browne, A. Langville, P. Pauca, and R.J. Plemmons, δn “Algorithms and applications for approximate nonnegative matrix fac- ≤ lim Jαn (Tα ) n n torization,” Computational Statistics and Data Analysis, vol. 52, pp. 155–173, 2007. ≤ lim Jα (Tˆ) n n [10] V. P. Pauca, J. Piper, and R. J. Plemmons, “Nonnegative matrix factor- δn ˆ δn 2 ˆ ization for spectral data analysis,” Linear Algebra and its Applications, ≤ lim kX T − X kF + αnkT k1,∞ n vol. 416, no. 1, pp. 29–47, 2006. 2 ˆ 2 ˆ [11] A. Zare, “Hyperspectral endmember detection ≤ lim δnkT − IdkF + αnkT k1,∞ n and band selection using Bayesian methods,” 2008, http://gradworks.umi.com/33/47/3347194.html. = 0. (18) [12] S. Moussaoui, D. Brie, A. Mohammad-Djafari, and C. Carteret, “Sepa- ¯ ration of non-negative mixture of non-negative sources using a bayesian Now, by the above estimate (18) we know that XT = X. approach and mcmc sampling,” Signal Processing, IEEE Transactions Furthermore, by the estimate (17) and taking into account that on, vol. 54, no. 11, pp. 4133 –4145, nov. 2006. [13] N. Dobigeon, S. Moussaoui, M. Coulon, J.-Y. Tourneret, and A.O. Tˆ was a non-negative k · k1,∞-minimum norm solution of Hero, “Joint bayesian endmember extraction and linear unmixing for XT = X, we have shown that the limit of our convergent hyperspectral imagery,” Signal Processing, IEEE Transactions on, vol. subsequence is also a non-negative k · k1,∞-minimum norm 57, no. 11, pp. 4355 –4368, nov. 2009. solution of XT = X. [14] J.M.P. Nascimento and J.M. Bioucas-Dias, “Blind hyperspectral unmixing,” in Proceedings of the SPIE Conference on Image and Signal Processing for Remote Sensing XIII, 2007, vol. 6748. REFERENCES [15] C. Boutsidis, M. W. Mahoney, and P. Drineas, “An improved approximation algorithm for the column subset selection problem,” in SODA [1] W. Naanaa and J-M. Nuzillard, “Blind source separation of positive and ’09: Proceedings of the twentieth Annual ACM-SIAM Symposium on partially correlated data,” Signal Processing, vol. 85, pp. 1711–1722, Discrete Algorithms, 2009, pp. 968–977. 2005. [16] T.F. Chan and P.C. Hansen, “Some applications of the rank revealing [2] M. E. Winter, “N-FINDR: an algorithm for fast autonomous spectral QR factorization,” SIAM J. Sci. and Stat. Comput., vol. 13, no. 3, pp. end-member determination in hyperspectral data,” in Imaging Spectrom- 727–741, 1992. etry V. 1999, vol. 3753, pp. 266–275, SPIE. [17] A. Krause and V. Cevher, “Submodular dictionary selection for sparse [3] R. Jenatton, J.-Y. Audibert, and F. Bach, “Structured variable selection representation,” in ICML, 2010. with sparsity-inducing norms,” Tech. Rep., 2009, arXiv:0904.3523v2. [18] A. Das and D. Kempe, “Submodular meets spectral: Greedy algorithms [4] J. A. Tropp, “Algorithms for simultaneous sparse approximation: part for subset selection, sparse approximation and dictionary selection,” II: Convex relaxation,” Signal Process., vol. 86, no. 3, pp. 589–602, Tech. Rep., 2011, arXiv:1102.3975. 2006. [19] L. Balzano, R. Nowak, and W.U. Bajwa, “Column subset selection [5] G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low-rank with missing data,” in Proc. NIPS Workshop on Low-rank Methods for representation,” in ICML, 2010, pp. 663–670. Large-scale Machine Learning, 2010. [6] W. Xuand X. Liu and Y. Gong, “Document clustering based on [20] A. Szlam, Z. Guo, and S. Osher, “A split Bregman method for non- non-negative matrix factorization,” in Proceedings of the 26th annual negative sparsity penalized least squares with applications to hyperspec- international ACM SIGIR conference on Research and development in tral demixing,” Tech. Rep., 2010, UCLA CAM Report [10-06]. informaion retrieval. 2003, SIGIR ’03, pp. 267–273, ACM. [21] Z. Guo, T. Wittman, and S. Osher, “L1 unmixing and its application [7] A. Holzapfel and Y. Stylianou, “Musical genre classiﬁcation using to hyperspectral image enhancement,” Proceedings SPIE Conference nonnegative matrix factorization-based features,” IEEE Trans. on Audio, on Algorithms and Technologies for Multispectral, Hyperspectral, and Speech, and Language Processing, vol. 16, pp. 424–434, 2008. Ultraspectral Imagery XV, vol. 7334, pp. 73341M–73341M–9, 2008. 15

[22] J. Greer, “Sparse demixing,” SPIE proceedings on Algorithms and Tech- Stanley Osher Stanley Osher received his B.S. de- nologies for Multispectral, Hyperspectral, and Ultraspectral Imagery gree in mathematics from Brooklyn College, Brook- XVI, vol. 7695, pp. 76951O–76951O–12, 2010. lyn, NY, in 1962, and M.S. and Ph.D. degrees in [23] A. Castrodad, Z. Xing, J. Greer, E. Bosch, L. Carin, and G. Sapiro, mathematics from New York University in 1964 and “Learning discriminative sparse models for source separation and 1966, respectively. He has been with the Univer- mapping of hyperspectral imagery,” Submitted September 2010, sity of California, Los Angeles, since 1977. He is http://www.ima.umn.edu/preprints/oct2010/oct2010.html. currently a Professor in the Department of Mathe- [24] P. Sprechmann, I. Ramrez, G. Sapiro, and Y. Eldar, “C- matics and the Director of Special Projects at the HiLasso: A collaborative hierarchical sparse modeling framework,” Institute for Pure and Applied Mathematics. He is http://arxiv.org/abs/1006.1346, June 2010. a member of the National Academy of Sciences, [25] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach, “Network flow the American Academy of Arts and Sciences and algorithms for structured sparsity,” CoRR, vol. abs/1008.5209, 2010. is one of the top 25 most highly cited researchers in both mathematics [26] D. Gabay and B. Mercier, “A dual algorithm for the solution of nonlinear and computer science. He has received numerous academic honors and variational problems via finite-element approximations,” Comp. Math. has co-founded three successful companies, each based largely on his own Appl., vol. 2, pp. 17–40, 1976. (joint) research. His current interests mainly involve information science [27] R. Glowinski and A. Marrocco, “Sur lapproximation par elements which includes image processing, compressed sensing and machine learn- finis dordre un, et la resolution par penalisation-dualite dune classe de ing. His website is www.math.ucla.edu/˜sjo. His recent papers are on the problemes de Dirichlet nonlineaires,” Rev. Francaise dAut Inf. Rech. website for the UCLA Computational and Applied Mathematics Reports at Oper., vol. R-2, pp. 41–76, 1975. http://www.math.ucla.edu/applied/cam/index.shtml. [28] X. Zhang, M. Burger, and S. Osher, “A unified primal-dual algorithm framework based on Bregman iteration,” Tech. Rep., 2009, UCLA CAM Report [09-99]. [29] J. J. Moreau, “Proximite´ et dualite´ dans un espace hilbertien,” Bull. Soc. Math. France, vol. 93, pp. 273–299, 1965. [30] E. Esser, “Applications of Lagrangian-based alternating direction meth- Guillermo Sapiro Guillermo Sapiro was born in ods and connections to split Bregman,” Tech. Rep., 2009, UCLA CAM Montevideo, Uruguay, on April 3, 1966. He received Report [09-31]. his B.Sc. (summa cum laude), M.Sc., and Ph.D. [31] J.M.P. Nascimento and J.M. Bioucas-Dias, “Vertex component analysis: from the Department of Electrical Engineering at the A fast algorithm to unmix hyperspectral data,” IEEE Trans. Geosci. Technion, Israel Institute of Technology, in 1989, Rem. Sens., vol. 43, pp. 898–910, 2004. 1991, and 1993 respectively. After post-doctoral [32] “Open source MATLAB hyperspectral toolbox,” 2010. Version 0.04. research at MIT, Dr. Sapiro became Member of http://matlabhyperspec.sourceforge.net/. Technical Staff at the research facilities of HP Labs [33] E. Candes, X. Li, Y. Ma, and J. Wright, “Ro- in Palo Alto, California. He is currently with the bust principal component analysis,” 2009, Department of Electrical and Computer Engineering http://arxiv.org/PS cache/arxiv/pdf/0912/0912.3599v1.pdf. at the University of Minnesota, where he holds the [34] B. He, M. Tao, and X. Yuan, “A splitting method for separate convex position of Distinguished McKnight University Professor and Vincentine programming with linking linear constraints,” Tech. Rep., 2010. Hermes-Luh Chair in Electrical and Computer Engineering. [35] Y. Sun, C. Ridge, F. del Rio, A.J. Shaka, and J. Xin, “Postprocessing G. Sapiro works on differential geometry and geometric partial differential and sparse blind source separation of positive and partially overlapped equations, both in theory and applications in computer vision, computer data,” Tech. Rep., 2010. graphics, medical imaging, and image analysis. He recently co-edited a special issue of IEEE Image Processing in this topic and a second one in the Journal of Visual Communication and Image Representation. He has authored and co-authored numerous papers in this area and has written a book published by Cambridge University Press, January 2001. G. Sapiro was awarded the Gutwirth Scholarship for Special Excellence in Graduate Studies in 1991, the Ollendorff Fellowship for Excellence in Vision and Image Understanding Work in 1992, the Rothschild Fellowship Ernie Esser Ernie Esser received B.S. degrees in for Post-Doctoral Studies in 1993, the Office of Naval Research Young math and applied and computational mathematical Investigator Award in 1998, the Presidential Early Career Awards for Scientist sciences from the University of Washington in 2003 and Engineers (PECASE) in 1998, the National Science Foundation Career and a Ph.D. in math from UCLA in 2010. He is Award in 1999, and the National Security Science and Engineering Faculty currently a postdoc at UC Irvine. His main research Fellowship in 2010. interests are convex optimization, image processing G. Sapiro is a member of IEEE and SIAM. and inverse problems. G. Sapiro is the funding Editor-in-Chief of the SIAM Journal on Imaging Sciences.

Jack Xin Jack Xin received his B.S in computational mathematics at Peking University in 1985, M.S and Ph.D in applied mathematics at New York University in 1988 and 1990. He was a postdoctoral fellow at Michael Moller¨ Michael Moller¨ recieved his dim- Berkeley and Princeton in 1991 and 1992. He was ploma in mathematics from the Westfalische¨ Wil- assistant and associate professor of mathematics at helms University of Munster¨ in 2009, where he the University of Arizona from 1991 to 1999. He currently is a Ph.D. student. He spent two years as was a professor of mathematics from 1999 to 2005 a visiting researcher at the University of California, at the University of Texas at Austin. He has been Los Angeles. His research interests include convex a professor of mathematics in the Department of optimization, compressed sensing, image processing, Mathematics, Center for Hearing Research, Institute and inverse problems. for Mathematical Behavioral Sciences, and Center for Mathematical and Computational Biology at UC Irvine since 2005. He is a fellow of the John S. Guggenheim Foundation. His research interests include applied analysis, computational methods and their applications in nonlinear and multiscale dynamics in ﬂuids, and signal processing.