<<

6254 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 23, DECEMBER 1, 2016 Robust Volume Minimization-Based Factorization for Remote Sensing and Document Clustering Xiao Fu, Member, IEEE, Kejun Huang, Student Member, IEEE,BoYang, Student Member, IEEE, Wing-Kin Ma, Senior Member, IEEE, and Nicholas D. Sidiropoulos, Fellow, IEEE

Abstract—This paper considers volume minimization (VolMin)- attention, since they are capable of not only reducing dimension- based structured matrix factorization. VolMin is a factorization ality of the collected data, but also retrieving loading factors that criterion that decomposes a given data matrix into a matrix have physically meaningful interpretations. times a structured coefficient matrix via finding the minimum- In to NMF, some related SMF models have attracted volume simplex that encloses all the columns of the data matrix. Recent work showed that VolMin guarantees the identifiability of considerable interest in recent years. The remote sensing com- the factor matrices under mild conditions that are realistic in a wide munity has spent much effort on a class of factorizations where variety of applications. This paper focuses on both theoretical and the columns of one factor matrix are constrained to lie in the practical aspects of VolMin. On the theory side, exact equivalence unit simplex [3]. The same SMF model has also been utilized for of two independently developed sufficient conditions for VolMin document clustering [4], and, most recently, multi-sensor array identifiability is proven here, thereby providing a more compre- processing and blind separation of power spectra for dynamic hensive understanding of this aspect of VolMin. On the algorithm side, computational complexity and sensitivity to outliers are two spectrum access [5], [6]. key challenges associated with real-world applications of VolMin. The first key question concerning SMF lies in identifiability– These are addressed here via a new VolMin algorithm that han- when does a factorization model or criterion admit unique so- dles volume regularization in a computationally simple way, and lution in terms of its factors? Identifiability is important in automatically detects and iteratively downweights outliers, simul- applications such as parameter estimation, feature extraction, taneously. Simulations and real-data experiments using a remotely and signal separation. In recent years, identifiability conditions sensed hyperspectral image and the Reuters document corpus are have been investigated for the NMF model [7]Ð[9]. An unde- employed to showcase the effectiveness of the proposed algorithm. sirable property of NMF highlighted in [9] is that identifiability Index Terms—Document clustering, hyperspectral unmixing, hinges on both loading factors containing a certain number of identifiability, matrix factorization, robustness against outliers, zeros. In many applications, however, there is at least one factor simplex-volume minimization (VolMin). that is dense. In hyperspectral unmixing (HU), for example, the basis factor (i.e., the spectral signature matrix) is always dense. I. INTRODUCTION On the other hand, very recent work [5], [10] showed that the TRUCTURED MATRIX FACTORIZATION (SMF) has SMF model with the coefficient matrix columns lying in the S been a popular tool in signal processing and machine learn- unit simplex admits much more relaxed identifiability condi- ing. For decades, factorization models such as the singular value tions. Specifically, Fu et al. [5] and Lin et al. [10] proved that, decomposition (SVD) and eigen-decomposition have been ap- under some realistic conditions, unique loading factors (up to plied for dimensionality reduction (DR), subspace estimation, column permutations) can be obtained by finding a minimum- noise suppression, feature extraction, etc. Motivated by the in- volume enclosing simplex of the data vectors. Notably, these fluential paper of Lee and Seung [2], new SMF models such identifiability conditions of the so-called volume minimization as nonnegative matrix factorization (NMF) have drawn much (VolMin) criterion allow working with dense basis matrix fac- tors; in fact, the model does not impose any constraints on the Manuscript received March 3, 2016; revised June 8, 2016 and July 28, 2016; basis matrix except for having full-column rank. Since the NMF accepted August 6, 2016. Date of publication August 25, 2016; date of current model can be recast as (viewed as a special case of) the above version October 6, 2016. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Wee Peng Tay. This SMF model [11], such results suggest that VolMin is an attrac- work was supported in part by the National Scientific Foundation (NSF) under tive alternative to NMF for the wide range of applications of Project NSF-ECCS 1608961 and in part by the Hong Kong Research Grants NMF and beyond. Council under Project CUHK 14205414. A part of this work was presented in Compared to NMF, VolMin-based matrix factorization is the Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 2016 [1]. computationally more challenging. The notable prior works X. Fu, K. Huang, B. Yang, and N. D. Sidiropoulos are with the De- in [12] and [13] formulated VolMin as a constrained (log- partment of Electrical and Computer Engineering, University of Minnesota, )determinant minimization problem, and applied successive Minneapolis, MN 55455, USA (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). and alternating optimization to deal with W.-K. Ma is with the Department of Electronic Engineering, The it, respectively. The major drawback of these pioneering works Chinese University of Hong Kong, Shatin, N.T., Hong Kong (e-mail: is that the algorithms were developed under a noiseless setting, [email protected]). and thus only work well for high signal-to-noise ratio (SNR) Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. cases. Also, these algorithms work in the dimension-reduced Digital Object Identifier 10.1109/TSP.2016.2602800 domain, but the DR process may be sensitive to outliers and

1053-587X © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. FU et al.: ROBUST VOLUME MINIMIZATION-BASED MATRIX FACTORIZATION FOR REMOTE SENSING AND DOCUMENT CLUSTERING 6255 modeling errors. The work [14] took noise into consideration, A conference version of part of this work appears in [1]. but the algorithm is computationally prohibitive and has no Beyond [1], this journal version includes the equivalence guarantee of convergence. Some other algorithms [15], [16] of the identifiability conditions, first-order optimization-based work in the original data domain, and deal with a volume- updates, consideration of different types of regularization and regularized data fitting problem. Such a formulation can tol- constraints, proof of convergence, extensive simulations, and erate noise to a certain level, but is harder to tackle than experiments using real data. those in [12], [13], [16]Ðvolume regularizers typically intro- Notation: We largely follow common notational conventions duce extra difficulty to an already very hard bilinear fitting in signal processing. x ∈ Rn and X ∈ Rm ×n denote a real- problem. valued n-dimensional vector and a real-valued m × n matrix, The second major challenge of implementing VolMin is that respectively (resp.). x ≥ 0 (resp. X ≥ 0) means that x (resp. ∈ n ∈ m ×n the VolMin criterion is very sensitive to outliers: it has been X) is element-wise non-negative. x R+ (resp. X R+ ) noted in the literature that even a single outlier can make the also means that x (resp. X) is element-wise non-negative. X  VolMin criterion fail [3]. However, in real-world applications, 0 and X  0 mean that X is positive definite and positive outlying measurements are commonly seen: in HU, pixels that semidefinite, resp. The superscripts “T ” and “−1” stand for the do not always obey the nominal model are frequently spotted transpose and inverse operations, resp. The p of a vector ∈ n ≥   n | |p 1/p because of the complicated physical environment [17]; and in x R , p 1, is denoted by x p =( i=1 xi ) .Thep document clustering, articles that are difficult to be classified to quasi-norm, 0

where vol(B) denotes a measure that is related or proportional to the volume of the simplex conv{b1 ,...,bK }, and (3b)Ð(3c) mean that every x[] is enclosed in conv{b1 ,...,bK } (i.e., x[] ∈ conv{b1 ,...,bK }). In the literature, various functions for vol(B) have been used [5], [12]Ð[16], [18]. One representa- tive choice of vol(B) is T vol(B)=det(B¯ B¯ ), B¯ =[b1 − bK ,...,bK −1 − bK ], (4) or its variants; see [13]Ð[15]. The reason of employing such a function is that det(B¯ T B¯ )/((N − 1)!) is the volume of the Fig. 1. Motivating examples: Hyperspectral unmixing and document { } clustering. simplex conv b1 ,...,bK by definition [21]. Another popular choice of vol(B) is vol(B)=det(BT B); (5)  see [5], [12], [16], [18]. Note that det(BT B)/(N!) is the

volume of the simplex conv{0, b1 ,...,bK }, which should scale similarly with the volume of conv{b1 ,...,bK }. The upshot of (5) is that (5) has a simpler structure than (4).

B. Identifiability of VolMin The most appealing aspect of VolMin is its identifiability of A and S: Under mild and realistic conditions, the optimal solution to Problem (3) is essentially the true (A, S). To be precise, let us make the following definition. Definition 1: (VolMin Identifiability) Consider the matrix Fig. 2. The intuition of VolMin. factorization model in (1), (2), and let (B , C ) be any opti- content, or social network posts), and cluster the data according mal solution to Problem (3). If every optimal (B , C ) satisfies to their weights on different topics/opinions. In hyperspectral B = AΠ and C = ΠT S, where Π denotes a permutation remote sensing [3], [20], x[] represents a remotely sensed pixel matrix, then we say that VolMin identifies the true matrix fac- using sensors of high spectral resolution, a1 ,...,aK denote K tors, or VolMin identifiability holds. different spectral signatures of materials that comprise the pixel Fu et al. [5] have shown that. x[], and sk [] denotes the proportion of material k contained in pixel x[]. Estimating A enables recognition of the underlying Theorem 1: Let vol(B) be the function in (5). Define a second order cone C = {x ∈ RN |1T x ≥ √ 1 x }. Then materials in a hyperspectral image. See Fig. 1 for an illustration N −1 2 of these motivating examples. Very recently, the same model VolMin identifiability holds if rank(A)=rank(S)=K and has been applied to power spectra separation [6] for dynamic i) C⊆cone(S); and spectrum access and fast blind speech separation [5]. In addi- ii) cone(S) ⊆ cone(Q) where Q is any unitary matrix ex- tion, many applications of NMF can also be considered under cept the permutation matrices. the model in (1) and (2), after suitable normalization [11]. Many algorithms have been developed for finding such a In plain words, a sufficient condition under which VolMin { }L factorization, and we refer the readers to [3], [4] for a survey. identifiability holds is when s[] =1 are sufficiently scattered C Among these algorithms, we are particularly interested in the over the unit simplex, such that the second order cone is a so-called VolMin criterion, which is identifiable under certain subset of cone(S). By comparing Theorem 1 to the identifia- reasonable conditions. VolMin is motivated by the nice geo- bility conditions for NMF (see [7]Ð[9]), we see a remarkable metrical interpretation of the constraints in (2): Under these advantageÐVolMin does not have any restriction on A except constraints, all the data points live in a convex hull spanned by being full-column rank. In fact, A can be dense, partially neg- ative, or even complex-valued. This result allows us to apply a1 ,...,aK (or, a simplex spanned by a1 ,...,aK when the ak ’s are linearly independent); see Fig. 2. If the data points are suffi- VolMin to a wider variety of applications than NMF. In [10], another sufficient condition for VolMin identifiability ciently spread in conv{a ,...,aK }, then the minimum-volume 1 was proposed. enclosing convex hull coincides with conv{a1 ,...,aK }.For- mally, the VolMin criterion can be formulated as Theorem 2: Let vol(B) be the function in (4). Assume rank(A)=rank(S)=K. Define R(r)={s ∈ RN |{s ≤ (A, {s[]}) = arg min vol(B) (3a) 2 B,{c[]} r}∩conv{e1 ,...,eN }} and γ = sup{r|R(r)}⊆conv(S)}. Then VolMin identifiability holds if γ>√ 1 . s.t. x[]=Bc[], (3b) N −1 Theorem 2 does not characterize its identifiability condition T ≥ ∀ 1 c[]=1, c[] 0, , (3c) using convex cones like Theorem 1 did. Instead, it defines a FU et al.: ROBUST VOLUME MINIMIZATION-BASED MATRIX FACTORIZATION FOR REMOTE SENSING AND DOCUMENT CLUSTERING 6257

Lemma 3), our result here also helps better understand the suffi- cient condition for NMF identifibility in [9] in a more intuitively pleasing way.

III. ROBUST VOLMIN VIA INEXACT BCD In this section, we turn our attention to designing algorithms for dealing with the VolMin criterion. Optimizing the VolMin criterion is challenging. In early works such as [12], [18], linear DR with X is assumed such that the basis after DR is a square matrix. This subsequently enables one to write the DR-domain VolMin problem as

Fig. 3. Visualization of the sufficient conditions on the hyperplane 1T x =1. min log | det(B˜ )| The sufficient conditions in [5] and [10] both require that C (the inner circle) is B˜ ∈RK ×K ,C∈RK ×L contained in cone(S). s.t. x˜[]=Bc˜ [] ‘diameter’ r of the convex hull spanned by the columns of S, 1T c[]=1, c[] ≥ 0, (6) and then develops an identifiability condition based on it. The sufficient conditions presented in the two theorems seem- where x˜[] ∈ RK is the dimension-reduced data vector corre- ingly have different flavors, but we notice that they are re- sponding to x[], and B˜ ∈ RK ×K is a dimension-reduced basis. lated in essence. To see the connections, we first note that Note that minimizing log | det(B˜ )| is the same as minimizing Theorem 1 still holds after replacing cone(S) and C with T det(B˜ B˜ ). Problem (6) can be efficiently tackled via either convex hulls conv{s[1],...,s[L]} and C∩conv{e ,...,e }, 1 N alternating optimization [13] or successive convex optimization respectivelyÐsince the s[]’s are all in conv{e ,...,e }.In 1 N [12], [18]. The drawback with these existing algorithms is that fact, C∩conv{e ,...,e } is exactly the set R(r) for r ≤ √ 1 N noise was not taken into consideration. Also, these approaches 1/ N − 1. Geometrically, we illustrate the conditions in Fig. 3 require DR to make the effective A squareÐbut DR may not be using N =3 for visualization. We see that, if we look at reliable in the presence of outliers or modeling errors. Another the conditions in Theorem 1 at the 2-dimensional hyperplane major class of algorithms such as those in [15], [16] considers that contains 1T x =1, the two conditions both mean that { } the inner shaded region is contained in conv s[1],...,s[L] .  − 2 · min X BC F + λ vol(B) Motivated by this observation, in this paper, we rigorously B∈RM ×K ,C∈RK ×L show that s.t. C ≥ 0, 1T C = 1T , (7) Theorem 3: The sufficient conditions for VolMin identifia- bility in Theorem 1 and Theorem 2 are equivalent. where λ>0 is a parameter that balances data fidelity versus The proof of Theorem 3 can be found in Appendix A. Al- VolMin. The formulation in (7) avoids DR and takes noise into though the geometrical connection may seem clear on hindsight, consideration. However, our experience is that volume regular- T rigorous proof is highly nontrivial. We first show that the con- izers, such as vol(B) = det(B B), are numerically harder to dition in Theorem 1 is equivalent to another condition, and then cope with, which will be explained in detail later. We should establish equivalence between the ‘intermediate’ condition and also compare Problem (7) with the VolMin formulation in (3). the condition in Theorem 2. Problem (3) enforces a hard constraint X = BC, and thus en- sures that every feasible B and C have full rank in the noiseless Remark 1: The equivalence between the sufficient condi- case. On the other hand, Problem (7) employs a fitting-based tions in Theorem 1 and Theorem 2 is interesting and surprisingÐ criterion, and an overly large λ could result in rank-deficient although the corresponding theoretical developments started factors even in the noiseless case. Hence, λ should be chosen from very different points of view, they converged to equiva- with caution. lent conditions. Their equivalence brings us deeper understand- Another notable difficulty is that outliers are very damag- ing of the VolMin criterion. The proof itself clarifies the role ing to the VolMin criterion. In many cases, a single outlier can of regularity condition ii) in Theorem 1, which was originally make the minimum-volume enclosing convex hull very different difficult to describe geometricallyÐand now we understand that from the desired one; see Fig. 4 for an illustration. The state-of- condition ii) is there to ensure γ>√ 1 , i.e., the existence of a N −1 the-art algorithm that considers outliers for the VolMin-based convex cone that is ‘sandwiched’ by cone(S) and C. In addition, factorization is simplex identification via split augmented La- the equivalence also suggests that the different cost functions grangian (SISAL) [18], but it takes care of outliers in the in (4) and (5) ensure identifiability of A and S under the same dimension-reduced domain. As already mentioned, the DR pro- sufficient conditions, and thus they are expected to perform sim- cess itself may be impaired by outliers, and thus dealing with ilarly in practice. On the other hand, since the function in (5) outliers in the original data domain is more appealing. Directly is easier to handle, using it in practice is more appealing. As a factoring X in the original data domain also has the advantage by-product, since we have proved that condition ii) is equivalent of allowing us to incorporate any aprioriinformation on A and to a condition that was used for NMF identifiability in [9] (cf. S, such as nonnegativity, smoothness, and sparsity. 6258 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 23, DECEMBER 1, 2016

see [27]. Notice that (9) is a coarse approximation of the volume of conv{b1 ,...,bK }, which measures the volume by simply adding up the squared distances between the vertices. B. Update of C Our idea is to update B and C alternately, i.e., using block coordinate descent (BCD). Unlike classic BCD [28], we solve the partial optimization problems in an inexact fashion for effi- ciency. We first consider updating C. The problem w.r.t. C is Fig. 4. The impact of outliers to VolMin. The dots are x[]’s; the shaded area is conv{a ,...,a }, the triangles with dashed lines are data-enclosing convex separable w.r.t. and convex. Therefore, after t iterations with 1 N t t hulls, and the one with solid lines is the minimum-volume enclosing convex the current solution (B , C ), we consider: hull. Left: the case where no outliers exist. Right: the case where a single outlier   exists. 1 2 ct+1[] := arg min x[] − Bt c[] c[] 2 2 A. Proposed Robust VolMin Algorithm s.t. 1T c[]=1, c[] ≥ 0, (10) We are interested in the VolMin-regularized matrix factor- ization, but we take the outlier problem into consideration. for =1,...,L. Since Problem (10) is convex, one can update Specifically, we propose to employ the following optimization C by solving Problem (10) to optimality. An alternating direc- surrogate of the VolMin criterion: tion method of multipliers (ADMM)-based algorithm was pro- vided in the conference version of this work for this purpose; see   L p the detailed implementation in [1]. Nevertheless, exactly solv- 1  − 2 2 λ T min x[] Bc[] 2 +  + log det(B B + τI) B,C 2 2 ing Problem (10) at each iteration is computationally costly, =1 especially when the problem size is large. Here, we propose to s.t. 1T c[]=1, c[] ≥ 0, ∀, (8) deal with Problem (10) using local approximation. Specifically, let where p ∈ (0, 2], λ>0, >0, and τ>0. Here, >0 is a small t 1 − t 2 regularization parameter, which keeps the first term inside its f(c[]; B )= x[] B c[] 2 . smooth region for computational convenience when p<1;if 2 p ∈ (1, 2], we can simply let  =0. The parameter τ>0 is also Then, f(c[]; Bt ) can be locally approximated at ct [] by the a small positive number, which is used to ensure that the cost following: function is bounded from below for any B.  T The motivation of using log det(BT B + τI) instead of the u(c[]; Bt )=f(ct []; Bt )+ ∇f(ct []; Bt ) (c[] − ct []) T commonly used volume regularizers such as det(B B) is com- t putational simplicity: Although both functions are non-convex L  − t 2 + c[] c [] 2 , and conceptually equally hard to deal with, the former features 2 a much simpler update rule because it admits a tight upper where Lt ≥ 0. On the right hand side (RHS) of the above, bound while the latter does notÐthis point will become clearer the first two terms constitute a first-order approximation of T shortly. Interestingly, log det(B B + τI) has been used in f(c[]; Bt ) at ct [], and the second term restrains ct+1[] to be the context of low-rank matrix recovery [22], [23], but here close to ct [] in terms of Euclidean distance. It is well-known we instead apply it for simplex-VolMin. The / -(quasi-) t t T t 2 p that when L ≥(B ) B 2 , norm data fitting part is employed to downweight the impact of the outliersÐwhen 0

Problem (12) is a simple projection that can be solved with the irrelevant terms, we find Bt+1 by solving the following: worst-case complexity of O(K log K) flops; see [30] for a de- L   tailed implementation. w 2 Bt+1 := arg min x[] − Bct+1[] The described update of C has light per-iteration complexity, B 2 2 but it could result in slow convergence of the overall alternat- =1 λ ing optimization algorithm; see Fig. 6 in the simulations. To + Tr(F t (BT B)). (16) improve the convergence speed in practice, and inspired by the 2 success of Nesterov’s optimal first-order algorithm and its re- Problem (16) is a convex quadratic program that admits the lated algorithms [31], [32], we propose the following update following closed-form solution: of C:  − t+1 t t+1 t t+1 T t+1 t+1 T t 1 c []=PL t (y []) (13a) B := XW (C ) C W (C ) + λF , (17) t 2 t+1 1+ 1+4(q ) q = (13b) t t t 2 where W = Diag(w1 ,...,wL ).

qt − 1  Remark 2: The expression in (16) reveals why the proposed yt []=ct []+ ct [] − ct−1 [] , (13c) qt+1 criterion and algorithm can automatically downweight the effect brought by the outliers. Suppose that (Bt , Ct+1) is a “good ∞ { t } 1 t where q t=1 is a sequence with q =1. Simply speaking, in- enough” solution which is close to the ground truth. Then, w is t stead of locally approximating f(c[]; B ) at ct [], we approx- small when x[] is an outlier since the fitting error term x[] − imate it at an ‘extrapolated point’ yt []. Without the alternating t t+1 2 t+1 B c [] 2 is large. Hence, for the next iteration, B is optimization procedure, using extrapolation is provably much estimated with the importance of the outlier x[] downweighted. faster than using the plain gradient-based methods [31], [32]. Embedding extrapolation into alternating optimization was first Remark 3: In practice, adding constraints on B by letting ∈B considered in [33] in the context of tensor factorization, where B is sometimes instrumental, since a lot of applications do acceleration of convergence was observed. In our case, the ex- have prior information that can be used to enhance performance. trapolation procedure also substantially reduces the number of For example, in image processing, a nonnegative B is often B M ×N B iterations for achieving convergence, as will be shown in the sought, and thus one can set = R+ . When is convex, simulations. the problem in (16) can usually be solved in an efficient man- ner; e.g., one can call general-purpose solvers such as interior- C. Update of B point methods. However, using general-purpose solvers here The update of B relies on the following two lemmas. may lose efficiency since solving constrained least squares to a certain accuracy per se may require a lot of iterations. To sim- Lemma 1: [34] Assume 0 0, and let plify the update, we update Bt following the same spirit of up- − p  2 p 2 − 2 p/2 L φp (w):= ( w) p 2 + w. Then, we have (x + ) = t+1 w  − t+1 2 2 p dating C:Letg(B; C )= =1 2 x[] Bc [] 2 + 2  minw ≥0 wx + φp (w). Also, the minimizer is unique and given λ t T L t Tr(F (B B)) + const, where const = φp (w ) − K. p p −2 2 =1 2 2 t+1 by wopt = 2 (x + ) . We solve a local approximation of g(B; C ): × Lemma 2: [35] Let E ∈ RK K be any matrix such that Bt+1 := arg min g(Bt ; Ct+1)+∇g(Bt ; Ct+1)T (B − Bt ) E  0. Consider the function f(F )=Tr(FE) − log det F − B∈B t K. Then, log det E = minF 0 f(F ), and the minimizer is μ t 2 −1 + B − B  uniquely given by F opt = E . 2 F  t t t t+1 The lemmas provide two functions that majorize the data := ProjB B − μ ∇g(B ; C ) , (18) fitting part and the volume-regularization part in (8), respec- where μt ≥ 0 and tively. Specifically, at iteration t and after updating C,wehave  ˆ t { t+1 }L ∇g(Bt ; Ct+1)=Bt Ct+1W t (Ct+1)T + λF t (B , c [] =1). Then, the following holds: log det(BT B + I) ≤ Tr(F t BT B) − log det F t − K, − XWt (Ct+1)T , (14) t t T t −1 is the partial derivative of the cost function in (16) w.r.t. B at where F =((B ) B + I) and the equality holds when t t B , and ProjB(Z) denotes the Euclidean projection of Z on B = B . Similarly, we have B.ForsomeB’s, the projection is easy to compute; e.g., when M L    p B = R ,wehaveProjB(Z) = max{Z, 0}; see other easily 1 2 2 + x[] − Bct+1[] +  implementable projections in [36]. Notice that the update in 2 2 =1 (18) can also easily incorporate extrapolation.

L t   L w 2 The robust volume minimization (RVolMin) algorithm is ≤ x[] − Bct+1[] + φ (wt ), (15) 2 2 p summarized in Algorithm 1. Its convergence properties are =1 =1 stated in Proposition 1, whose proof is relegated to Appendix B. p t p −2 t  − t+1 2 2 t t where w = 2 ( x B c [] 2 + ) and the equality Proposition 1: Assume that L and μ are chosen such that t t t T t t t T t holds when B = B . Putting (14), (15) together and dropping L ≥(B ) B 2 and μ ≥(F ) F 2 , respectively. Also, 6260 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 23, DECEMBER 1, 2016

initialize the algorithms that provide identifiability guarantees can sometimes enhance the performance of the latter.

IV. SIMULATIONS In this section, we provide simulations to showcase the ef- fectiveness of the proposed algorithm. We generate the ele- ments of A ∈ RM ×K from the uniform distribution between zero and one. We generate s[] on the unit simplex and ≤ 1 ≤ ≤ with maxi si [] γ, where K γ 1 is given. We choose γ =0.85, which results in a so-called ‘no-pure-pixel case’ in the context of remote sensing and is known to be challenging to handle; see [3], [4] for details. Zero-mean white Gaussian noise is added to the generated data. To model outliers, we de- assume that B is a convex closed set. Then, if the initial objec- fine the outlier at data point as o[] and let A⊆{1,...,L} be tive value is finite, the whole solution sequence generated by the index set of outliers. We assume that o[]=0 if /∈Aand S Algorithm 1 converges to the set that consists of all the sta- x[]=o[] otherwise. We denote No = |A| as the total num- tionary points of Problem (8), i.e., ber of outliers. Those active outliers are generated following the  uniform distribution between zero and one, and are scaled to sat- lim d(t) Bt , Ct , S =0, t→∞ isfy problem specifications. For the proposed algorithm, we fix p =0.5,  =10−12, and τ =10−8 unless otherwise specified. (t) t t S  − t t 2 where d ((B , C ), ) = minY ∈S Y (B , C ) F . We stop the proposed algorithm when the absolute change of −5 Remark 4: As mentioned before, we may also use different the cost function is smaller than 10 or the number of iterations volume regularizers. Let us consider the volume regularizer in reaches 1000. (9) first. It was shown in [27] that this regularizer can also be We define the signal-to-noise ratio (SNR) as SNR = { 2 } T T E As[] 2 expressed as vol(B)=Tr(GB B), where G = KI − 11 . 10 log10( { 2 } ). Also, to quantify the corruption caused E v[] 2 Therefore, by letting F t = G in Algorithm 1, the updates can by the outliers, we define the signal-to-outlier ratio (SOR) as { 2 } be directly applied to handle the regularizer in (9). Dealing with E As[] 2 SOR = 10 log10( E{o[]2 } ). We use the mean-squared-error (5) is more difficult. One possible way is to make use of (18) 2 (MSE) of A as a measure of factorization performance, defined since det(BT B) is differentiable. The difficulty is that a global as upper bound of the subproblem w.r.t. B may not exist. Under   such circumstances, sufficient decrease at each iteration needs K  2 1  ak aˆπ k  to be guaranteed for establishing convergence to a stationary MSE = min  −  , π∈Π K ak 2 aˆπ 2 point [37]. In practice, the Armijo rule is usually invoked to k=1 k 2 achieve this goal, which in general is computationally more where Π is the set of all permutations of {1, 2,...,K}; and aˆ costly compared to the cases where μt can be determined in k is the estimate of ak . closed form. In this section, we use the SISAL algorithm proposed in [18] Remark 5: Problem (8) is a nonconvex optimization prob- as a baseline. SISAL is a state-of-art robust VolMin algorithm lem. Hence, a good starting point of RVolMin can help reach that takes outliers into account by solving meaningful solutions quickly. In practice, different initializa- min log det(B˜ )+ηCh , tions can be considered: B˜ ,1T C=1T ,{x˜ []=Bc˜ []} • Existing VolMin algorithms. Many VolMin algorithms, such   · L K − as the ones working in the reduced-dimension domain (e.g., the where h = =1 k=1 max( ck [], 0) is an element-wise algorithms in [13], [18]), exhibit good efficiency. The difficulty hinge function. The intuition behind SISAL is to penalize the is that these algorithms are usually sensitive to the DR process outliers whose c[] has negative elements, but still allowing in the presence of outliers. Nevertheless, one can employ robust them to exist, thereby having some robustness to outliers. The DR algorithms together with the algorithms in [13], [18] as an tuning parameter η>0 in SISAL controls the amount of outliers initialization approach. Nuclear norm-based algorithms [38] are that are “allowed,” and we test multiple η’s for SISAL in the viable options for robust DR, but are not suitable for large-scale simulations. We run the original SISAL that uses SVD-based problems because of the computational complexity. Under such dimension reduction and the modified SISAL which uses the circumstances, one may adopt simple alternatives such as that robust dimension reduction (RDR) algorithm in [39]. The latter proposed in [39]. is also used to initialize the proposed algorithm. • Nonnegative matrix factorization. If A is known to be non- We first use an illustrative example to show the effective- negative, any NMF algorithm can be employed as initialization. ness of the proposed algorithm in the presence of outliers. In In practice, dealing with NMF is arguably simpler relative to this example, we set SNR = 18 dB, SOR = −10 dB, No = VolMin, and many efficient solvers for NMF existÐsee [40] for 20, (M,K) = (50, 3), and L = 1000. The results are pro- a survey. Although NMF usually does not provide a satisfac- jected onto the affine set that contains conv{a1 , a2 , a3 }, i.e., a tory result on its own in cases where it cannot guarantee the two-dimensional hyperplane. In Fig. 5, we see that SISAL with identifiability of its factors, using the NMF-estimated factors to different η’s cannot yield reasonable estimates of A since the FU et al.: ROBUST VOLUME MINIMIZATION-BASED MATRIX FACTORIZATION FOR REMOTE SENSING AND DOCUMENT CLUSTERING 6261

Fig. 7. MSE of Aˆ obtained by different algorithms vs. SNR. (M,K)= Fig. 5. The Aˆ ’s estimated by various algorithms. Blue points are x[]’s. (50, 5); No =20; SOR = −5dB.

Fig. 6. Objective value vs. iterations, using different update strategies for C. Fig. 8. MSE of proposed algorithm vs. λ. (M,K)=(50, 5); No =20; L = 1000; SOR = −5dB. DR stage threw the data to a badly estimated subspace. Using RDR, SISAL performs better, but is still not satisfactory. In this 20.5 seconds, while using local approximation with extrapola- case, the proposed algorithm yields the most accurate estimate tion costs 2.58 seconds to reach the same objective level. In the of A. upcoming simulations, all the results of the proposed algorithms In Fig. 6, we show the convergence curves of the algorithm are obtained with the extrapolation strategy. under different update rules of C, i.e., ADMM in [1], the pro- In Fig. 7, we show the MSE performance of the proposed posed local approximation, and local approximation with ex- algorithm versus SNR. We fix SOR = −5 dB, and let No =20, trapolation. We show the results averaged from 10 trials, where (M,K) = (50, 5), and L = 1000. In Fig. 7, we see that the SNR = 18 dB and SOR = −5 dB. We see that using ADMM, the original SISAL fails for different η’s. Using RDR, SISAL with objective value converges within 400 iterations. Local approx- η =0.1 yields reasonable results for all the tested SNRs. The imation with extrapolation uses around 800 iterations to attain proposed algorithm with λ =1and λ =0.5 gives the lowest convergence of the objective value, but the objective value can- MSEs. The MSEs given by RVolMin with λ =1are the lowest not converge within 3000 iterations without extrapolation. In when SNR ≤ 20 dB, and RVolMin with λ =0.5 exhibits the terms of runtimes, the local approximation methods uses 0.003 best MSE performance when SNR ≥ 25 dB. The results are second per iteration (a complete update of both C and B), consistent with the intuition behind selecting λ: when the SNR while ADMM costs 0.05 second per iteration. Obviously, local is low, a relatively large λ is needed to enhance the effect of the approximation with extrapolation is the most favorable update VolMin regularization. scheme: its number of iterations for achieving convergence is To understand the effect of selecting λ,weplottheMSEs around twice of that of ADMM, but it is 15 times faster relative of the proposed algorithm versus λ in Fig. 8. We see that there to ADMM for completing an update of C. Specifically, in the exists an (SNR-dependent) optimal choice of λ for achieving case under test, the average time for the algorithm using ADMM the lowest MSE, but also note that any λ in the range considered to update C to achieve the pointed objective value in Fig. 6 is yields satisfactory results in both cases. 6262 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 23, DECEMBER 1, 2016

ˆ Fig. 9. MSE of Aˆ versus K. M =50; No =20; L = 1000; SOR = −5dB. Fig. 11. MSE of A versus No . (M,K)=(50, 5); L = 1000; SOR = −5 dB; SNR = 20 dB.

TABLE I THE MSESOFTHEALGORITHMS UNDER ILL-CONDITIONED A. (M,K)=(50, 5); L = 1000; No =20; SOR = −5 DB.

TABLE II THE MSESOFTHEPROPOSED ALGORITHM WITH DIFFERENT VOL(B)’S. (M,K)=(50, 5); L = 1000; No =20; SNR = 20 DB.

Fig. 10. MSE of Aˆ versus SOR. (M,K)=(50, 5); No =20; L = 1000; SNR = 20 dB.

Fig. 9 shows the MSE performance of the algorithms versus ≤ K.WefixSNR= 20 dB and the other settings are the same with RDR yields reasonable MSEs when No 40, but its per- as in the previous simulation. The results of SISAL and SISAL formance deteriorates when No is larger. with RDR are also used as baselines. We run several η’s for Table I presents the MSEs of the estimated Aˆ under well- and SISAL and present the results of the one with the lowest MSEs. ill-conditioned A’s, respectively. To generate an ill-conditioned As expected, all the algorithms work better when the rank of A, we use a way that is similar to the method suggested the factorization model is lowerÐwhich is consistent with past in [11]: in each trial, we first generate A˜ whose columns experience on different matrix factorization algorithms, such are uniformly distributed between zero and one, and such as [40]. SISAL and SISAL with RDR work reasonably when A˜ ’s are relatively well-conditioned. Then, we apply singular K =3, but deteriorate when K ≥ 6. On the other hand, even value decomposition to obtain A˜ = UΣV T . Finally, we re- when K =15, the proposed algorithm still works well, giving place Σ by Σ˜ = Diag([1, 0.1, 0.01, 0.005, 0.001]) and obtain the lowest MSE. A = UΣ˜ V T . This way, the condition number of the generated Fig. 10 shows the MSEs of the algorithms versus SORs. One A is 103 . The other settings are the same as those in Fig. 7. can see that when some data are badly corrupted, i.e., when One can see from Table I that using such ill-conditioned A,all SOR ≤−10 dB, the proposed algorithm yields significantly the algorithms perform worse compared to the scenario where lower MSEs than SISAL and SISAL with RDR. When SOR ≥ A has uniformly distributed columns (cf. the first and second 0 dB, all three algorithms provide comparable performance. columns in Table I). Nevertheless, the proposed algorithm still We also test the algorithms versus the number of outliers. gives the lowest MSEs. In Fig. 11, one can see that the proposed algorithm is not very In Table II, we present the MSE performance of the proposed sensitive to the change of No : the MSE curve of the proposed al- algorithm using different volume regularizers. We see that us- T gorithm is quite flat for different No ’s in this simulation. SISAL ing vol(B)=Tr(GBB ) has the shortest runtime since the FU et al.: ROBUST VOLUME MINIMIZATION-BASED MATRIX FACTORIZATION FOR REMOTE SENSING AND DOCUMENT CLUSTERING 6263

TABLE III THE MSESOFTHEPROPOSED ALGORITHM WITH AND WITHOUT NONNEGATIVITY CONSTRAINT ON B UNDER VARIOUS SNRS. (M,K)=(50, 5); L = 1000; No =20; SOR = 5 DB.

Fig. 13. The considered subimage of the Moffet data set.

It is interesting to note that using p =0.1 gives slightly worse result compared to using p ∈ [0.25, 0.75]. Our understanding is that using a very small p may lead to numerical problems, { }L since the weights w =1 can be scaled in a very unbalanced way in such cases, resulting in ill-conditioned optimization sub- problems. For the cases where SOR = −5dBand5dB,a similar effect can be seen. In addition, a larger range of p, i.e., p ∈ [0.25, 1.5], can result in good performance when SOR = Fig. 12. MSE of the proposed algorithm with different p’s under various 5 dB. The results suggest a strategy of choosing p: When the SORs. (M,K)=(50, 5); L = 1000; No =20; SNR = 20 dB. data is believed to be badly corrupt, using p around 0.5 is a good choice; and when the data is only moderately corrupted, using subproblem w.r.t. B is convex and can be solved in closed form. ∈ T p [1, 1.5] is preferable, since such a p gives good performance When vol(B) = det(B B), the algorithm requires much more and can better avoid numerical problems. time compared to that of the other two regularizers. This is be- cause the Armijo rule has to be implemented at each iteration. V. R EAL DATA VALIDATION In terms of accuracy, using the log det(BT B) regularizer gives the lowest MSEs when SOR ≤−5dB.Usingdet(BT B) also In this section, we validate the proposed algorithm using two exhibits good MSE performance when SOR ≥ 0dB.Using real data sets, i.e., a hyperspectral image dataset with known Tr(GBBT ) performs slightly worse in terms of MSE, since it outliers and a document dataset. is a coarse approximation to simplex volume. Interestingly, al- though our proposed log-determinant regularizer is not an exact A. Hyperspectral Unmixing measure of simplex volume as the determinant regularizer, it Hyperspectral unmixing (HU) is the application where yields lower MSEs relative to the latter. Our understanding is VolMin-based factorization is most frequently applied; see [3]. that the performance gain results from the ease of computation. As introduced before, HU aims at estimating A, i.e., the spectral Table III presents the MSE of the proposed algorithm with signatures of the materials that are contained in a hyperspectral and without nonnegativity constraint on B, respectively. We image, and also their proportions s[] in each pixel. It is well- see that the MSEs are similar, with those of the nonnegativity- known that there are outliers in hyperspectral images, due to the constrained algorithm being slightly lower. This result validates complicated reflection environment, spectral band contamina- the soundness of our update rule for the constrained case, i.e., tion, and many other reasons [17]. In this experiment, we apply (18). In terms of speed, the unconstrained algorithm requires the proposed algorithm to a subimage of the real hyperspectral less time. We note that the nonnegativity constraint seems to image that was captured over the Moffett Field in 1997 by the only bring marginal performance gain in this simulation. This Airborne Visible/Infrared Imaging Spectrometer (AVIRIS)1;see might be because the data are generated following the model Fig. 13. We remove the water absorption bands from the original in (1) and (2), and under this model VolMin identifiability does 224 spectral bands, resulting in M = 200 bands for each pixel not depend on the nonnegativity of B. However, when we are x[]. In this subimage with 50 × 50 pixels, there are three types dealing with real data, adding nonnegativity constraints makes of materialsÐwater, soil, and vegetation. In the areas where dif- much sense, as will be shown in the next section. ferent materials intersect, e.g., the lake shore, there are many Fig. 12 shows the effect of changing p. When SOR = −10 dB, outliers as identified by domain study. Our goal here is to test we see that using p ∈ [0.25, 0.75] gives relatively low MSEs. This is because using a small p is more effective in fending against outliers that largely deviate from the nominal model. 1Online available http://aviris.jpl.nasa.gov/data/image_cube.html. 6264 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 23, DECEMBER 1, 2016

TABLE IV THE CLUSTERING ACCURACY ON REUTERS 21578 CORPUS

[41], [42]. In the last subimage of Fig. 15, we plot 1/w for =1,...,L. Notice that the weight w corresponds to the ‘im- portance’ of the data x[]. The algorithm is designed to auto- matically give small w to outlying pixels. We see that there are Fig. 14. The spectra of the manually selected pure pixels and the estimated a lot of pixels on the lake shore that have very small weights, spectra by the algorithms. indicating that they are outliers. Physically, these outliers corre- spond to those areas where the solar light reflects several times between water and soil, resulting in nonlinearly mixed spectral signatures [41], [42]. The locations of the outlying pixels iden- tified by our algorithm are also consistent with domain study [41], [42].

B. Document Clustering We also present experimental results using the Reuters21578 document corpus2. We use the subset of the full corpus provided by [43], which contains 8213 single-labeled documents from 41 clusters. In our experiment, we test our algorithm under different K (number of clusters), from 3 to 10. Following standard pre- processing, each document is represented as a term-frequency- inverse-document-frequency (tf-idf) vector, and normalized cut weighting is applied; see [43]Ð[45] for details. We apply our VolMin algorithm to factor the document data X to ‘topics’ A Fig. 15. The abundance maps of the materials and the distribution of the and a ‘weight matrix’ S (cf. Fig. 1), and use S to indicate the outliers obtained by the proposed algorithm. cluster labels of the documents. A regularized NMF-based ap- proach, namely, locally consistent concept factorization (LCCF) whether our algorithm can identify the three materials and the [43] is employed as the baseline, which is considered a state- outliers simultaneously. of-the-art algorithm for clustering the Reuters21578 corpus. We apply SISAL, SISAL with RDR, and the proposed algo- For each K, we perform 100 Monte-Carlo trials by randomly rithm to estimate A.Wesetλ =1for our algorithm and tune η selecting K clusters out of the total 41 clusters and 100 docu- B M×K ments from each cluster. We report the performance by compar- for SISAL carefully. Notice that we let = R+ in this case, since the spectral signatures are known to be nonnegative. The ing the results with the ground truth. Performance is measured estimated spectral signatures are shown in Fig. 14. As a bench- by a commonly used metric called clustering accuracy, whose mark, we also present the spectra of some manually selected detailed definition can be found in [43]Ðthe clustering accu- pixels, which are considered purely contributed by only one racy ranges from 0 to 1, and higher accuracies indicate better material, and thus can approximately serve as the ground truth. performances. We see that both SISAL and SISAL with RDR does not yield Table IV presents the results averaged from the 100 trials. accurate estimates of A. Particularly, the spectrum of water is For the proposed algorithm, we set λ =30and p =1.5;we badly estimated by both of these algorithmsÐone can see that also present the result of p =2in this experiment, which we there are many negative values of the spectra of water given by also use to initialize the p<2 case. Note that here we use a SISAL and SISAL with RDR. On the other hand, the proposed larger λ relative to what was used in the simulations. The rea- algorithm with nonnegativity constraint on B gives spectra that son is that the document corpus contains considerable modeling are very similar to those of the pure pixels. errors and the fitting residue is relatively large. The rule of thumb Fig. 15 shows the spatial distributions of the materials (i.e., for selecting λ is to set it at a similar level as the fitting error { }L part; i.e., we let λ = O(δ), where δ = X − Aˆ Sˆ2 and can the abundance maps sk [] =1 for k =1,...,K) that are es- F timated by the proposed algorithm in the first three subimages. We see that the three materials are well separated, and their 2Online available: http://www.daviddlewis.com/resources/testcollections/ abundance maps are consistent with previous domain studies reuters21578/ FU et al.: ROBUST VOLUME MINIMIZATION-BASED MATRIX FACTORIZATION FOR REMOTE SENSING AND DOCUMENT CLUSTERING 6265 be coarsely estimated using plain NMF. This way, λ can bal- Combining Eqs. (19) and (20), we have ance the fitting part and the volume regularizer. From Table IV, ⊆C∗ ∩ ∗ we see that VolMin with p =2already yields comparable clus- cone(Q) cone(S) . (21) tering accuracy with LCCF. This indicates that, even without We also know that the extreme rays of cone(Q) lie in the outlier-robustness, modeling the document clustering problem boundary of C∗, i.e., ex{cone(Q)}⊆bdC∗ [9, Lemma 1]. Thus, using VolMin-based SMF is effective. Better accuracies can be we have seen by using p =1.5, where we see that for most K, the pro- { }⊆ C∗ ∩ ∗ posed RVolMin algorithm gives the best accuracy. In particular, ex cone(Q) bd cone(S) . (22) for K =6, 7, more than 4% accuracy improvement can be seen, ∗ ∗ Since we assumed cone(S) ∩ bdC = {λiei |i =1,...,N}, which is considered significant in the context of document clus- we have tering. Interestingly, further decreasing p does not yield better { }⊆{ } performance. This implies that the modeling error is not very ex cone(Q) e1 ,...,eN . (23) severe, but outliers do exist, since using p<2 gives better clus- Therefore, Q can only be a permutation matrix. tering result than using p =2which is not robust to modeling We now show the “⇐” part. Following (22), and knowing errors. that cone(S) is a subset of the convex cone of some permutation VI. CONCLUSION matrix, we see that ∗ ∗ In this work, we looked into theoretical and practical aspects {e1 ,...,eN }⊆cone(S) ∩ bdC . (24) of the VolMin criterion for matrix factorization. On the theory { } side, we showed that two independently developed sufficient Now, suppose that there are a set of vectors r1 ,...,rp that conditions for VolMin identifiability are in fact equivalent. On does not include any unit vectors such that the practical side, we proposed an outlier-robust optimization ∗ ∗ cone(S) ∩ bdC = {e1 ,...,eN , r1 ,...,rp }. surrogate of the VolMin criterion, and devised an inexact BCD ∗ { N ∪ algorithm to deal with it. Extensive simulations showed that Then, we see that we can represent cone(S) = conv R+ the proposed algorithm outperforms a state-of-the-art robust cone{r1 ,...,rp }}. By Property 2, we see that VolMin algorithm, i.e., SISAL. The proposed algorithm was { N ∪ { }}∗ also validated using real-world hyperspectral image data and cone(S)=conv R+ cone r1 ,...,rp document data, where interesting and favorable results were N ∗ = R ∩ cone{r1 ,...,rp } . (25) observed. + N ∗ N Since R ∩ cone{r1 ,...,rp } = cone(S) ⊆ R , (25) APPENDIX A + + leads to A. Proof of Theorem 3 { }∗ ⊆ N cone r1 ,...,rp R+ . To show Theorem 3, several properties of convex cones will By Property 1, we see that be constantly used. They are: N ∗ N ⊆ { } (R+ ) = R+ cone r1 ,...,rp . Property 1: Let K1 and K2 be convex cones. Then, K1 ⊆ K ⇒K∗ ⊆K∗ 2 2 1 . This is a contradiction to the assumption that {r1 ,...,rp } does not include the unit vectors.  Property 2: Let K1 and K2 be convex cones. Then, (K1 ∩ K ∗ {K∗ ∪K∗} 2 ) = conv 1 2 . Now we are ready to prove Theorem 3. We first notice ∗ C⊆ ≥ 1 Property 3: If Q is a unitary matrix. Then, cone(Q) = that cone(S) is equivalent to γ N −1 . In fact, we see that C∩S= R( √ 1 ), where S = {s|1T s =1, s ∈ RN }, cone(Q). N −1 and thus the claim holds. Thus, our remaining work is to show We show Theorem 3 step by step. First, we show the following that condition (ii) in Theorem 1 is equivalent to restricting γ lemma: such that γ>√ 1 N −1 Lemma 3: Assume that C⊆cone(S) and Q is any uni- Step 1: Let us consider a conic representation of Theorem 2. tary matrix except the permutation matrices. Then, we Specifically, the corresponding convex cone of R(r)={s ∈ ∗ ∗ have cone(S) ∩ bdC = {λiei |i =1,...,N}⇔cone(S) ⊆ N |{  ≤ }∩ { }} R˜ C ∩ N R s 2 r conv e1 ,...,eN is (r)= (r) R+ , cone(Q). where Proof: We first show the “⇒” part. Given a unitary Q, sup- N T C(r)={s ∈ R |s2 ≤ r1 s}, pose that cone(S) ⊆ cone(Q). and γ = sup{r|C(r) ⊆ cone(S)} under this definition. It is also noticed that C(r) can be re-expressed as By the basic properties of convex cones, we see that   ∗ ∗ ∗  T cone(Q) ⊆ cone(S) , and cone(Q) = cone(Q). Combining,  1 s 1 C(r)= s ≥ √ . we see 12 s2 r N cone(Q) ⊆ cone(S)∗. (19) In words, the vectors whose angles between 1 are less than or equal to arccos √1 comprises C(r). Therefore, by the def- Also, we have r N C⊆cone(S) ⇒ cone(S)∗ ⊆C∗ ⇒ cone(Q) ⊆C∗. (20) inition of dual cone, i.e., C(r)∗ = {s|sT y ≥ 0, y ∈C}, C(r)∗ 6266 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 23, DECEMBER 1, 2016 contains all the vectors that have the angle with 1 less than or t →∞by (13c). Hence, if the proposition holds for the algo- equal to arccos √ r , which leads to rithm without extrapolation, it also holds for the extrapolated r 2 N −1 version asymptotically. r C(r)∗ = C √ . (26) First, let us cast the proposed algorithm into the framework r2 N − 1 of block successive upper bound minimization (BSUM) [29], Step 2: Now, we consider the dual cone representation of [37]. Unlike the classic block coordinate descent algorithm that Theorem 2. Let us define solves every block subproblem exactly [28], BSUM cyclically solves the upper-bound problems of every block subproblems. T C ∪ N (r)=conv( (r) R+ ). We consider the updates using (12) and (16) as an example. The proof of using other updates will follow. Our update rule in (12) Then, following Property 2 and (26), we have and (16) can be equivalently written as R˜ ∗ C ∩ N ∗ t+1 t (r) =( (r) R+ ) (27a) C = arg min uC (C; B ) (30a) 1T C=1T ,C≥0 r = conv C √ ∪ RN , (27b) t+1 t+1 2 + B = arg min uB (B; C ). (30b) r N − 1 B  r t L 1 t p/2 λ = T √ , (27c) where uC (C; B )= =1 2 (2u(c[]; B )+) + 2 log det r2 N − 1 ((Bt )T Bt + τI), u(c[]; Bt ) is defined as before, and

N ∗ N L wherewehaveused(R+ ) = R+ , i.e., Property 3. According    t+1 w  t+1 2 to Property 1, we see that uB (B; C )= x[] − Bc [] 2 2 =1 ∗ r R˜(r) ⊆ cone(S) ⇔ cone(S) ⊆T √ . (28) λ r2 N − 1 + Tr(F t (BT B)) + const, 2 { | ∗ ⊆T }  If we define κ = inf r cone(S) (r) , we see from (28) L t − in which const = =1 φp (w ) K. Note that solving (30a) is and the definition of γ that equivalent to solving (12) over different ’s since the problems 1 w.r.t. =1,...,Lare not coupled. Also denote v(B, C) as the γ>√ ⇔ κ<1. (29) t ≥ t T t  N − 1 objective value of Problem (8). When L (B ) B 2 ,we have Step 3: To show the equivalence between the sufficient condi- v(Bt , C) ≤ u (C; Bt ), ∀C (31a) tions, let us begin from Theorem 1. The condition C⊆cone(S) C C∗ ⊆ ∗ t+1 t+1 means that cone(S) by Property (1), and it further im- v(B, C ) ≤ uB (B; C ), ∀B, (31b) plies that {e ,...,e }⊆ex{cone(S)∗} [9]. Suppose the rest 1 N t t T t ∗ where (31a) holds because under L ≥(B ) B 2 we have of cone(S) ’s extreme rays are r1 ,...,rp , and t is defined as t = inf{r|cone{r ,...,r }⊆C(r)}. Then, we have 1 p t 1 − t 2 ≤ t ∀ f(c[]; B )= x[] B c[] 2 u(c[]; B ), c[], ∗ 2 cone(S) = cone{r1 ,...,rp , e1 ,...,eN } and thus N L ⊆ {C ∪ } T   p conv (t) R+ = (t). 1 v(Bt , C)= 2f(c[]; Bt )+ 2 2 Hence, by the definitions of t and κ,wehavet = κ. =1 Given the above analysis, we first show that Theorem 1 im- λ plies Theorem 2. Now, assume that Condition (ii) in Theo- + log det((Bt )T Bt + τI) ∗ ∗ 2 rem 1 is satisfied. We see that cone(S) ∩ bdC = {λi ei |i = 1,...,N} by Lemma 3. Then, r ,...,r are in the interior of L 1 p 1 t p C∗ C ≤ (2u(c[]; B )+) 2 = (1). Therefore, we have t<1, and subsequently κ<1 2 and γ>√ 1 strictly. =1 N −1 Now we show the converse by contradiction. Assume that λ + log det((Bt )T Bt + τI) γ>√ 1 holds (the condition in Theorem 2 is satisfied). If at N −1 2 ∗ least one point in r1 ,...,rp touches the boundary of C (con- t = uC (C; B ); dition (ii) in Theorem 1 is not satisfied), then t =1, and we cannot decrease it further while still contain cone(S)∗ in T (t)∗. Eq. (31b) holds because of Lemmas 1 and 2; also note that the √ 1 equalities hold when C = Ct and B = Bt , respectively. Since This means that κ =1, or, equivalently, γ = − , which con- N 1 all the functions above are continuously differentiable, we also tradicts our first assumption that γ>√ 1 . N −1 have t t t t ∇C v(B , C )=∇C uC (C ; B ) (32a) B. Proof of Proposition 1 ∇ v(Bt , Ct+1)=∇ u (Bt ; Ct+1). (32b) In the following, we prove the proposition under the algo- B B B rithmic structure without extrapolation. For the case where we Note that if we update C using ADMM, we have v(Bt , C)= t t t update C using (13), it is easy to see that y [] → c [] given uC (C; B ) for all CÐEqs. (31a) and (32a) still hold. Also, if FU et al.: ROBUST VOLUME MINIMIZATION-BASED MATRIX FACTORIZATION FOR REMOTE SENSING AND DOCUMENT CLUSTERING 6267

we update B by (18), the conditions in (31b) and (32b) are also [5] X. Fu, W.-K. Ma, K. Huang, and N. D. Sidiropoulos, “Blind separation satisfied when μt ≥(F t )T F t  since we now we have of quasi-stationary sources: Exploiting convex geometry in covariance 2 domain,” IEEE Trans. Signal Process., vol. 63, no. 9, pp. 2306Ð2320, u (B; Ct+1)=g(Bt ; Ct+1)+∇g(Bt ; Ct+1)T (B − Bt ) May 2015. B [6] X. Fu, N. D. Sidiropoulos, and W.-K. Ma, “Tensor-based power spectra separation and emitter localization for cognitive radio,” in Proc. IEEE μt  − t 2 + B B F , 8th Sensor Array Multichannel Signal Process. Workshop, 2014, pp. 421Ð 2 424. and it can be shown that v(B, Ct+1) ≤ g(B; Ct+1) ≤ [7] D. Donoho and V.Stodden, “When does non-negative matrix factorization t+1 give a correct decomposition into parts?” in Proc. 17th Annu. Conf. Neural uB (B; C ) and Inf. Process. Syst., 2003, vol. 16, pp. 1141Ð1148. t t+1 t t+1 t t+1 [8] H. Laurberg, M. G. Christensen, M. D. Plumbley, L. K. Hansen, and ∇Bv(B , C )=∇Bg(B ; C )=∇BuB (B ; C ), S. Jensen, “Theorems on positive data: On the uniqueness of NMF,” t Comput. Intell. Neurosci., vol. 2008, 2008, Art. ID. 764206. and the equalities hold simultaneously at B = B . Eqs. (31), [9] K. Huang, N. D. Sidiropoulos, and A. Swami, “Non-negative matrix fac- (32) satisfy the sufficient conditions for a generic BSUM algo- torization revisited: New uniqueness results and algorithms,” IEEE Trans. Signal Process., vol. 62, no. 1, pp. 211Ð224, Jan. 2014. rithm to converge (cf. Assumption 2 in [37]). In addition, by [10] C.-H. Lin, W.-K. Ma, W.-C. Li, C.-Y.Chi, and A. Ambikapathi, “Identifia- t t [37, Theorem 2 (b)], if we can show that (B , C ) lives in a bility of the simplex volume minimization criterion for blind hyperspectral compact set for all t, we can prove Proposition 1. unmixing: The no-pure-pixel case,” IEEE Trans. Geosci. Remote Sens., t t vol. 53, no. 10, pp. 5530Ð5546, Oct. 2015. Next, we show that in every iteration, B and C are bounded. [11] N. Gillis and S. Vavasis, “Fast and robust recursive algorithms for sepa- The boundness of Ct is evident because we enforce feasibility rable nonnegative matrix factorization,” IEEE Trans. Pattern Anal. Mach. at each iteration. To show that Bt is bounded, we first note that Intell., vol. 36, no. 4, pp. 698Ð714, Apr. 2014. [12] J. Li and J. M. Bioucas-Dias, “Minimum volume simplex analysis: A fast v(Bt , Ct ) ≥ v(Bt+1, Ct+1), algorithm to unmix hyperspectral data,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., 2008, vol. 3, pp. 250Ð253. where the inequality holds since the non-increasing property of [13] T.-H. Chan, C.-Y. Chi, Y.-M. Huang, and W.-K. Ma, “A convex analysis- 0 0 based minimum-volume enclosing simplex algorithm for hyperspectral the BSUM framework [37]. By the assumption that v(B , C ) unmixing,” IEEE Trans. Signal Process., vol. 57, no. 11, pp. 4418Ð4432, is bounded, i.e., Nov. 2009. 0 0 [14] A. Ambikapathi, T.-H. Chan, W.-K. Ma, and C.-Y. Chi, “Chance- v(B , C ) ≤ V, constrained robust minimum-volume enclosing simplex algorithm for hy- perspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 11, where V<∞,wehave pp. 4194Ð4209, Nov. 2011. L   p [15] L. Miao and H. Qi, “Endmember extraction from highly mixed data using 1 2 λ x[] − Bc[]2 +  + log det(BT B + τI) ≤ V minimum volume constrained nonnegative matrix factorization,” IEEE 2 2 2 Trans. Geosci. Remote Sens., vol. 45, no. 3, pp. 765Ð777, Mar. 2007. =1 [16] G. Zhou, S. Xie, Z. Yang, J.-M. Yang, and Z. He, “Minimum-volume- t t constrained nonnegative matrix factorization: Enhanced ability of learn- holds for every (B, C) ∈{B , C }t=1,..., that is generated by ing parts,” IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1626Ð1637, the algorithm. Since the first term on the left hand side of the Oct. 2011. [17] N. Dobigeon, J.-Y. Tourneret, C. Richard, J. Bermudez, S. Mclaughlin, above inequality is nonnegative, we have  and A. O. Hero, “Nonlinear unmixing of hyperspectral images: Models N and algorithms,” IEEE Signal Process. Mag., vol. 31, no. 1, pp. 82Ð94, log det(BT B + τI) ≤ V ⇔ log (σ2 + τ) ≤ V Dec. 2014. i [18] J. M. Bioucas-Dias, “A variable splitting augmented lagrangian approach i=1 to linear spectral unmixing,” in Proc. 1st Workshop Hyperspectral Image ⇒ 2 ≤ − − ∀ Signal Process. Evol. Remote Sens. 2009, 2009, pp. 1Ð4. log(σi + τ) V (N 1) log τ, i (33a) [19] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.: Cambridge Univ. Press, 2004. ⇒ 2 ≤ − − − ∀ σi exp (V (N 1) log τ) τ, i, (33b) [20] X. Fu, W.-K. Ma, T.-H. Chan, and J. M. Bioucas-Dias, “Self-dictionary sparse regression for hyperspectral unmixing: Greedy pursuit and pure where σ1 ,...,σN denote the singular values of B, and (33a) pixel search are related,” IEEE J. Sel. Topics Signal Process., vol. 9, no. 6, holds since log(σ2 + τ) ≥ log τ for all i. The right hand side pp. 1128Ð1141, Sep. 2015. i [21] P. Gritzmann, V. Klee, and D. Larman, “Largest j-simplices in n- of (33b) is bounded, which implies that every singular value of polytopes,” Discr. Comput. Geom., vol. 13, no. 1, pp. 477Ð515, 1995. Bt for t =1, 2,...is bounded. Since B is a closed , [22] M. Fazel, H. Hindi, and S. P. Boyd, “Log-det heuristic for matrix rank we conclude that the sequence {Bt , Ct } lies in a compact set. minimization with applications to hankel and euclidean distance matrices,” t in Proc. Amer. Control Conf. 2003, 2003, vol. 3, pp. 2156Ð2162. Now, invoking [37, Theorem 2 (b)], the proof is completed. [23] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE Trans. Pattern ACKNOWLEDGMENT Anal. Mach. Intell., vol. 35, no. 1, pp. 171Ð184, Jan. 2013. [24] Y.-F. Liu, S. Ma, Y.-H. Dai, and S. Zhang, “A smoothing SQP frame- The authors would like to thank Prof. N. Dobigeon for pro- work for a class of composite l_q minimization over ,” Math. viding the subimage of the Moffet data. Program., pp. 1Ð34, 2015. [25] H. Ekblom, “lp -methods for robust regression,” BIT Numer.Math., vol. 14, no. 1, pp. 22Ð32, 1974. REFERENCES [26] S. A. Vorobyov, Y. Rong, N. D. Sidiropoulos, and A. B. Gershman, “Ro- bust iterative fitting of multilinear models,” IEEE Trans. Signal Process., [1] X. Fu, W.-K. Ma, K. Huang, and N. Sidiropoulos, “Robust volume vol. 53, no. 8, pp. 2678Ð2689, Aug. 2005. minimization-based structured matrix factorization via alternating opti- [27] M. Berman, H. Kiiveri, R. Lagerstrom, A. Ernst, R. Dunne, and J. Hunt- mization,” in Proc. 2016 IEEE Int. Conf. Acoust., Speech, Signal Process., ington, “ICE: A statistical approach to identifying endmembers in hy- Mar. 2016, pp. 2534Ð2538. perspectral images,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 10, [2] D. Lee and H. Seung, “Learning the parts of objects by non- pp. 2085Ð2095, Oct. 2004. negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788Ð791, [28] D. P. Bertsekas, Nonlinear Programming. Belmont, MA, USA: Athena 1999. Scientific, 1999. [3] W.-K. Ma et al., “A signal processing perspective on hyperspectral un- [29] M. Hong, M. Razaviyayn, Z.-Q. Luo, and J.-S. Pang, “A unified algo- mixing,” IEEE Signal Process. Mag., vol. 31, no. 1, pp. 67Ð81, Jan. 2014. rithmic framework for block-structured optimization involving big data: [4] N. Gillis, “The why and how of nonnegative matrix factorization,” in Reg- With applications in machine learning and signal processing,” IEEE Signal ularization, Optimization, Kernels, and Support Vector Machines, vol. 12. Process. Mag., vol. 33, no. 1, pp. 57Ð77, Jan. 2016. London, U.K.: Chapman & Hall, 2014, pp. 257Ð291. 6268 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 23, DECEMBER 1, 2016

[30] W. Wang and M. A. Carreira-Perpin˜an,« “Projection onto the probability Bo Yang (S’15) received the B.Eng. and M.Eng. simplex: An efficient algorithm with a simple proof, and an application,” degrees in communication and information system 2013, arxiv preprint, arxiv: 1309.1541v1. from the Huazhong University of Science and Tech- [31] Y. Nesterov, Introductory Lectures on Convex Optimization, vol. 87. New nology of China, Wuhan, China, in 2011 and 2014, York, NY, USA: Springer, 2004. respectively. Since 2014, he has been working to- [32] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding al- ward the Ph.D. degree in the Department of Electrical gorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1, and Computer Engineering, University of Minnesota, pp. 183Ð202, 2009. Minneapolis, MN, USA. His research interests in- [33] Y. Xu and W. Yin, “A block coordinate descent method for regularized clude data analytics and machine learning, with a multiconvex optimization with applications to nonnegative tensor factor- recent emphasis on factor analysis. He received the ization and completion,” SIAM J. Imag. Sci., vol. 6, no. 3, pp. 1758Ð1789, 2013. Best Student Paper Award at International Workshop [34] X. Fu, K. Huang, W.-K. Ma, N. Sidiropoulos, and R. Bro, “Joint ten- on Computational Advances in Multi-Sensor Adaptive Processing 2015. sor factorization and outlying slab suppression with applications,” IEEE Trans. Signal Process., vol. 63, no. 23, pp. 6315Ð6328, Dec. 2015. [35] J. Jose, N. Prasad, M. Khojastepour, and S. Rangarajan, “On robust weighted-sum rate maximization in MIMO interference networks,” in Proc. IEEE Int. Conf. Commun., 2011, pp. 1Ð6. [36] N. Parikh and S. Boyd, “Proximal algorithms,” Found. Trends Optim., Wing-Kin Ma (M’01–SM’11) received the B.Eng. vol. 1, no. 3, pp. 123Ð231, 2013. degree in electrical and electronic engineering from [37] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified convergence analysis the University of Portsmouth, Portsmouth, U.K., in of block successive minimization methods for nonsmooth optimization,” 1995, and the M.Phil. and Ph.D. degrees, both in elec- SIAM J. Optim., vol. 23, no. 2, pp. 1126Ð1153, 2013. tronic engineering, from The Chinese University of [38] E. J. Candes,` X. Li, Y. Ma, and J. Wright, “Robust principal component Hong Kong (CUHK), Shatin, N.T., Hong Kong, in analysis?” J. ACM, vol. 58, no. 3, p. 11, 2011. 1997 and 2001, respectively. He is currently an As- [39] F. Nie, J. Yuan, and H. Huang, “Optimal mean robust principal component sociate Professor with the Department of Electronic analysis,” in Proc. 31st Int. Conf. Mach. Learn., 2014, pp. 1062Ð1070. Engineering, CUHK. From 2005 to 2007, he was also [40] K. Huang and N. Sidiropoulos, “Putting nonnegative matrix factorization an Assistant Professor with the Institute of Communi- to the test: A tutorial derivation of pertinent CramerÐRao bounds and cations Engineering, National Tsing Hua University, performance benchmarking,” IEEE Signal Process. Mag., vol. 31, no. 3, Taiwan, R.O.C. Prior to becoming a faculty member, he held various research pp. 76Ð86, May 2014. positions with McMaster University, Canada; CUHK; and the University of [41] C. Fevotte« and N. Dobigeon, “Nonlinear hyperspectral unmixing with robust nonnegative matrix factorization,” IEEE Trans. Image Process., Melbourne, Melbourne, Vic, Australia. His research interests include signal pro- vol. 24, no. 12, pp. 4810Ð4819, Aug. 2015. cessing, communications and optimization, with a recent emphasis on MIMO [42] A. Halimi, Y. Altmann, N. Dobigeon, and J.-Y. Tourneret, “Nonlinear transceiver designs and interference management, blind separation and struc- unmixing of hyperspectral images using a generalized bilinear model,” tured matrix factorization, and hyperspectral unmixing in remote sensing. IEEE Trans. Geosci. Remote Sens., vol. 49, no. 11, pp. 4153Ð4162, He currently serves as a Senior Area Editor of IEEE TRANSACTIONS ON SIG- Nov. 2011. NAL PROCESSING and Associate Editor of Signal Processing. Also, he has previ- [43] D. Cai, X. He, and J. Han, “Locally consistent concept factorization for ously served as an Associate Editor and a Guest Editor of several journals, which document clustering,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 6, include IEEE TRANSACTIONS ON SIGNAL PROCESSING,IEEESIGNAL PROCESS- pp. 902Ð913, Jun. 2011. ING LETTERS, IEEE JOURNAL OF SELECTED AREAS IN COMMUNICATIONS,and [44] W. Xu and Y. Gong, “Document clustering by concept factorization,” in IEEE SIGNAL PROCESSING MAGAZINE. He was a tutorial speaker in European Proc. 27th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2004, Signal Processing Conference 2011 and International Conference on Acoustics, pp. 202Ð209. Speech and Signal Processing (ICASSP) 2014. He is currently a Member of the [45] C. D. Manning, P. Raghavan, and H. Schutze,¬ Introduction to Information Signal Processing Theory and Methods Technical Committee (SPTM-TC) and Retrieval, vol. 1. Cambridge, U.K.: Cambridge Univ. Press, 2008. the Signal Processing for Communications and Networking Technical Commit- tee (SPCOM-TC). He received several awards, such as ICASSP Best Student Xiao Fu (S’12–M’15) received the B.Eng. and Paper Awards, in 2011 and 2014, respectively (by his students), WHISPERS M.Eng. degrees in communication and information 2011 Best Paper Award, Research Excellence Award 2013Ð2014 by CUHK, and engineering from the University of Electronic Sci- 2015 IEEE SIGNAL PROCESSING MAGAZINE Best Paper Award. ence and Technology of China, Chengdu, China, in 2005 and 2010, respectively and the Ph.D. degree in electronic engineering from the Chinese University of Hong Kong (CUHK), Shatin, N.T., Hong Kong, in 2014. From 2005 to 2006, he was an Assistant En- gineer at China Telecom Co. Ltd., Shenzhen, China. Nicholas D. Sidiropoulos (F’09) received the He is currently a Postdoctoral Associate at the De- Diploma degree in electrical engineering from the partment of Electrical and Computer Engineering, Aristotelian University of Thessaloniki, Thessa- University of Minnesota, Minneapolis, MN, USA. His research interests in- loniki, Greece, and the M.S. and Ph.D. degrees clude signal processing and machine learning, with a recent emphasis on factor in electrical engineering from the University of analysis and its applications. He received the Best Student Paper Award at IEEE Maryland—College Park, College Park, MD, USA, International Conference on Acoustics, Speech and Signal Processing 2014, and in 1988, 1990, and 1992, respectively. He served was a finalist of the Best Student Paper Competition at IEEE Systems Analysis as an Assistant Professor at the University of Vir- and Modeling Workshop 2014. He also co-authored a paper that received the ginia (1997Ð1999); an Associate Professor at the Best Student Paper Award at IEEE International Workshop on Computational University of Minnesota, Minneapolis, MN, USA Advances in Multi-Sensor Adaptive Processing 2015. (2000Ð2002); a Professor at the Technical Univer- sity of Crete, Greece (2002Ð2011); and a Professor at the University of Min- nesota (2011Ðpresent). His current research interest includes signal and tensor Kejun Huang (S’13) received the bachelor’s degree analytics, with applications in cognitive radio, big data, and preference measure- from Nanjing University of Information Science and ment. He received the NSF/CAREER award (1998), the IEEE Signal Processing Technology, Nanjing, China, in 2010, and the Ph.D. Society (SPS) Best Paper Award (2001, 2007, 2011), and the IEEE SPS Mer- degree in electrical engineering from the University itorious Service Award (2010). He has served as an IEEE SPS Distinguished of Minnesota, Minneapolis, MN, USA, in 2016. His Lecturer (2008Ð2009), and a Chair of the IEEE Signal Processing for Com- research interests include signal processing, machine munications and Networking Technical Committee (2007Ð2008). He received learning, and optimization. the Distinguished Alumni Award of the Department of Electrical and Computer Engineering, University of Maryland, College Park, in 2013, and was elected EURASIP Fellow in 2014.