Closed-Form, Provable, and Robust PCA Via Leverage Statistics

Home , Leverage (statistics)

arXiv:2106.12190v1 [stat.ML] 23 Jun 2021 in fmatrices of iesoa subspace dimensional emtto arx and matrix, permutation where a eepesdas expressed be can iemdl .. ti sue htdt matrix data that colum [10 assumed the [6], is on [21], it focuses i.e., paper [11], This model, [18], [23]. wise affected [40], [38], the are [39], [37], data In [13], not the corruption [24], [4]. are of data [1], columns the elements data of by subset the corrupted a of the problem, column/row matrix second and any sparse corrupted in elements plus concentrated are the rank of data subset low random of as a that known assumes corrupti decomposition, problem, problems data the first PCA for robust The models different different addition, man two in In to two interest corresponding corrupted. are great is of There data is applications. the components of outlying is by portion the found output locating small subspa subspace underlying a its true the if the that dimension, even from deviate sense intrinsic arbitrarily the can low PCA in outliers has to usef data sensitive is PCA the While reduction. when dimensionality linear for used erh nuevsdLann,Lvrg Statistics Leverage Learning, Unsupervised Search, upromms fteeitn algorithms. i existing closed-form, and the the fast of against is most in approach outperform algorithms studies presented theoretical the the and while that of numerical noise.The robustness results of t the presence theoretical I for inliers. demonstrate of The models distribution we the method. different and outliers PCA with of distribution robust guarantees and new method performance PCA include a robust th based design several Score to Leverage establish a to for Val utilized guarantees Innovation Scores Innovation is Leverage that to connection the equivalent a proved are interesting function under is study cost it new algorithm We the and Search with points. function Innovation cost data the quadratic me outl outlier by the to for computed for utilized of Values Search were used innovation innovation Innovation of recently the of directions application was the detection, the clustering, In data detection. for proposed rnia opnn nlss(C)hsbe extensively been has (PCA) Analysis Component Principal ne Terms Index Abstract U .. the i.e., , A lsdFr,Poal,adRbs C via PCA Robust and Provable, Closed-Form, Teie fInvto erh hc a initially was which Search, Innovation of idea —The ∈ eeaeSaitc n noainSearch Innovation and Statistics Leverage A R Rbs C,OtirDtcin Innovation Detection, Outlier PCA, —Robust n M i and 1 oun of columns × n .I I. D i B U , h oun of columns The . h oun of columns The . ([ = B NTRODUCTION [ A B ∈ A B ] A R ersnsteconcatenation the represents M ]) r h nir n the and inliers the are 1 × T 00 E8hS.Bleu,W 80,USA 98004, WA Bellevue, St. 8th NE 10900 n , o , B T { ontleentirely lie not do amn.u,pingli98 rahmani.sut, A sa unknown an is otf amn n igLi Ping and Rahmani Mostafa D i nan in lie ontv optn Lab Computing Cognitive ∈ addition, n R eoretical M ad Research Baidu dicate 1 This . asure can t × ues on. M n ier he ce n- r ul ], y o 2 - stecmlmn of complement the as akof rank U etr,and vectors, ignlvle r qa otennzr iglrvle of values singular non-zero the to equal D are values diagonal etsnua vectors, singular left Matrix ai o h niedata, entire the for basis vector a P o matrix a For ehdi netmt for estimate an is method of columns eoe t rbnu om and norm, Frobenius its denotes • pl multiplication. decomposition value matrix singular one one include method a only existing they uses present the while The outperform which mostly presented. method, methods is closed-form PCA similarity, of robust measure new symmetric a Search, novation estab to • used guarantees. is theoretical connection several interesting This function. cost ssuidadi ssonta h loih a provably can algorithm Notation noisy. the B. is data that are the shown when inliers is outliers and the it distinguish outliers and n of studied the presence the is of to robustness the distribution Furthermore, presented. the for models qiaett eeaeSoei udai otfnto is function the cost of instead quadratic directions optimal a the if find to Score used Leverage to equivalent follows. as summarized The • be methods. can existing contributions the main outperform PCA mostly robust provable which and methods closed-form performance thorough present with We supported guarantees. not are complexi them computational of most high and with each iterations of number Contributions the of on Summary points A. data the projecting by of located complement be can outliers h osof rows the , hoeia efrac urnesudrsvrldiffere several under guarantees performance Theoretical nprdb h xlnto fLvrg crswt In- with Scores Leverage of explanation the by Inspired ie matrix a Given oto h xsigrbs C loihsrqiealarge a require algorithms PCA robust existing the of Most ti rvdta noainVleitoue n[2 is [32] in introduced Value Innovation that proved is It i ∈ } k @gmail.com a R i k M D D 2 a 1 . and , , × = S k r B M a r sdfie sabssfor basis a as defined is U k 1 A d − r p r h ules h upto outPCA robust a of output The outliers. the are U ′ V , 1 Σ stern of rank the is stern of rank the is eoe its denotes . a V niae h unit the indicates A ∈ i , Σ eoe its denotes R where U k r A ∈ . d U × k R U M r sabssfrteinliers, the for basis a is If . eoe t pcrlnorm, spectral its denotes ℓ d 2 U p × A nr and -norm r qa otergtsingular right the to equal are ′ r U A d h subspace The . D ∈ T i sadaoa arxwhose matrix diagonal a is setmtdacrtl,the accurately, estimated is h rhnra matrix orthonormal The . th R stetasoeof transpose the is ℓ M ounand column 2 nr peein sphere -norm 1 U × oethat Note . a r d ( i ) stemti of matrix the is its ℓ U 1 nr based -norm ⊥ i th k sdefined is A r element. U d k A k 1 ′ sthe is R A , For . 2 oise sa is M lish k ed us = nt ty 1 F s 1 . 2

II. RELATED WORK with iSearch. In the rest of this section, Coherence Pursuit and iSearch are reviewed in more details. Robust PCA is a well-known problem and many approaches were developed for this problem. In this section, we briefly Coherence Pursuit (CoP): CoP [28] assigns a value, termed Coherence Value, to each data point and is recovered using review some of the previous works on robust PCA. Some of U the earliest approaches to robust PCA are based on robust the span of the data points with the highest Coherence Values. estimation of the data covariance matrix, such as the mini- The Coherence Value corresponding to data point di represents d mum covariance determinant, the minimum volume ellipsoid, the similarity between i and the rest of the data points. and the Stahel-Donoho estimator [14], [10], [36]. However, CoP uses inner-product to measure the similarity between data these methods mostly compute a full SVD or eigenvalue points and it distinguishes the outliers based on the fact that decomposition in each iteration and their performance greatly an inlier bears more resemblance to the rest of the data than an outlier. degrades when ni < 0.5. In addition, they lack performance no guarantees with structured outliers. Another approach is to Innovation Search (iSearch) [32]: Innovation Pursuit was replace the Frobenius Norm in the cost function of PCA with initially proposed as a data clustering algorithm [31], [30]. ℓ1-norm [20], [16], as ℓ1-norm was shown to be robust to The authors of [31], [30] showed that Innovation Pursuit can the presence of the outliers [3], [16]. In order to leverage notably outperform the self-representation based clustering the column-wise structure of the outliers, the authors of [7] methods (e.g. Sparse Subspace Clustering [8]) specifically replaced the ℓ1-norm minimization problem used in [16] with when the clusters are close to each other. Innovation Pursuit an ℓ1,2-norm minimization problem. In [19] and [40], the computes an optimal direction corresponding to each data optimization problem used in [7] was relaxed to two different point di which can be written as the optimal point of convex optimization problems. The authors of [19], [40] provided sufficient conditions under which the optimal points T T min c D subject to c di =1 . (1) of the convex optimization problems proposed in [19], [40] c k k1 are guaranteed to yield an exact basis for . The approach U presented in [35] focused on the scenario in which the data is If C∗ RM1×M2 contains all the optimal directions, Inno- predominantly unstructured outliers and the number of outliers vation Pursuit∈ builds the adjacency matrix as Q + QT where is larger than M1. In [35], it is essential to assume that the Q = DT C∗ . In [32], it was shown that the optimal direc- outliers are randomly distributed on SM1−1 and the inliers tions can| be| utilized for outlier detection too. The approach are distributed randomly on the intersection of SM1−1 and proposed in [32], termed iSearch, assigns an Innovation Value . The outlier detection method proposed in [34] assumes to each data point and it distinguishes the outliers as the data U 1 that the outliers are randomly distributed on SM −1 and points with the higher Innovation Values. The Innovation Value a small number of them are not linearly dependent which assigned to di is computed as means that [34] is not able to detect the linearly dependent outliers and the outliers which are close to each other. Another 1 approach is based on decomposing the given D into a low DT c∗ rank matrix and a column sparse matrix where the column k i k1 sparse matrix models the presence of the outliers [5], [37]. c∗ However, this approach requires the number of outliers to be where i is the optimal point of (1). iSearch needs to run an significantly smaller than the number of the outliers and the iterative solver to find the optimal directions. In contrast, the solver algorithms that are used to decompose the data need to methods presented in this paper are closed-form and they can compute a SVD of the data in each iteration. be hundreds of time faster. The main shortcomings of the previous methods are sensi- Leverage Statistics: In regression, Leverage Scores are de- tivity to structured outliers and the lack of comprehensive the- fined as the diagonal values of the hat matrix XT (XXT )−1X oretical guarantees. The Coherence Pursuit method, proposed where X Rm1×m2 is the design matrix [9]. Assuming that ∈ th in [28], was shown (theoretically and numerically) to be robust the rank of X is equal to m1, then the i leverage score is 2 Rm1×m2 to different types of outliers. However, Coherence Pursuit can equal to vxi 2 where the rows of Vx are equal to k k ∈ th miss outliers which carry weak innovation with respect to the the right singular vectors of X and vxi is the i column of Vx. inliers. The iSearch algorithm, proposed in [32], was shown Leverage has been typically used in the regression framework to notably outperform Coherence Pursuit in detecting outliers and there are few works which focused on using it for the with weak innovation . Similar to Coherence Pursuit, the robust PCA problem [25], [26]. For instance, [25] utilized robustness of iSearch against different types of outliers were leverage to reject the outlying time points in an functional supported with several theoretical guarantees. However, in magnetic resonance images (fMRI) run. However, there is not contrast to Coherence Pursuit which is a closed-form method, still an analysis and full understating of Leverage in the robust iSearch needs to run an iterative and computationally expen- PCA setting. We show that Innovation Value introduced in [32] sive solver to find the directions of innovation. This paper is equivalent to Leverage Score if a quadratic cost function is presents robust PCA methods which have the advantages of used to find the optimal directions. This interesting connection both (CoP and iSearch) algorithms: while they are closed-form is used to establish several theoretical guarantees and to design algorithms, their ability in distinguishing outliers are on a par a new robust PCA method. 3

Algorithm 1 Asymmetric Normalized Coherence Pursuit Algorithm 2 Symmetric Normalized Coherence Pursuit (ANCP) for Robust PCA (SNCP) for Robust PCA Input. The inputs are the data matrix D RM1×M2 and r Input. The inputs are the data matrix D RM1×M2 and r which is the dimension of the recovered subspace.∈ which is the dimension of the recovered subspace.∈

1. Normalize the ℓ2-norm of the columns of D, i.e., set di 1. Similar to Step 1 in Algorithm 1. equal to di/ di for all 1 i M . Rrd×M2 k k2 ≤ ≤ 2 2. Define V as in Algorithm 1. Rrd×M2 ∈ 2. Rows of V are equal to the first rd right singular 3. The vector of Normalized Coherence Values is defined as vectors of D where∈ is the rank of D. rd M2 (v T v )2 3. Define x RM2 , vector of Normalized Coherence Values, i j x(i)= 2 2 . ∈ vi vj as j=1 2 2 X k k k k 1 4. Construct matrix Y as in Algorithm 1. x(i)= 2 . (2) vi k k2 Output: The column-space of Y is the identified subspace. 4. Construct matrix Y from the columns of D corresponding to the largest elements of x such that they span an r- dimensional subspace. A. Explaining Leverage Score for Robust PCA Using Innova- Output: The column-space of Y is the identified subspace. tion Search Algorithm 1 ranks the data points based on the inverse of their leverage scores. The following lemma shows that III. ROBUST PCA WITH LEVERAGE AND INNOVATION Leverage Score is directly related to Innovation Value. SEARCH Lemma 1. Suppose rows of V Rrd×M2 are equal to the first r right singular vectors of D∈ where r is the rank of D. The table of Algorithm 1 and the table of Algorithm 2 d d Define c∗ as the optimal point of show the presented methods along with the used definitions. i Algorithm 1 utilizes Leverage Scores to rank the data points min cT D subject to cT d =1 . (3) c 2 i and the data points with the minimum Leverage Scores are k k T ∗ 2 1 used to build a basis for . In this section, we show the Then, D c = 2 . i 2 kvik2 underlying connection betweenU Algorithm 1 and iSearch and k k the motivation for naming Algorithm 1 “Asymmetric Normal- Lemma 1 indicates that if a quadratic cost function is used to ized Coherence Pursuit” is explained. In addition, we explain compute the optimal directions in iSearch, Innovation Values the motivations behind the design of Algorithm 2 based are equivalent to Leverage Scores. Accordingly, we can use the on the connection between Algorithm 1 and iSearch. Both idea of Innovation Search to explain the Leverage Score based algorithms are closed-form and they are faster than most of robust PCA method. First suppose that di is an outlier which ⊥ the existing methods. The computation complexity of ANCP means that di has a non-zero projection on . Since most of the data points are inliers, the optimization problemU utilizes the is (rdM1M2) and the computation complexity of SNCP is ⊥ ⊥ O 2 projection of d in and finds the optimal direction near (rdM1M2 + rdM ). O 2 to minimize AT cU 2. In sharp contrast, when d is an inlier,U The presented robust PCA algorithms use the data points i 2 i the linear constraintk k strongly discourages the optimal direction corresponding to the largest Normalized Coherence Values ⊥ T ∗ 2 to be close to . Thus, A ci 2 is notably larger when di to form the basis matrix Y. If the inliers are distributed U k T ∗ k is an inlier comparing to A ci 2 when di is an outlier. Ac- uniformly at random in , then r data points corresponding 1 k T ∗k 2 T ∗ 2 T ∗ 2 U cordingly, since v 2 = D ci 2 = A ci 2 + B ci 2, to the r largest Normalized Coherence Values span with k ik2 k k k k k k U 2 high probability. However, in real data, the inliers form some 1/ vi 2 is much larger when di is an inlier comparing to the k k T ∗ 2 same value when di is an outlier because A c is much clustering structure and the algorithm should continue adding k i k2 new columns to Y until the columns of Y span an r- larger when di is an inlier. The following Lemma indicates that DT c∗ 2 can be dimensional subspace. It means that we need to check the k i k2 singular values of Y multiple times. Two techniques can written as the sum of the similarities between the columns of V Rrd×M2 . be utilized to avoid these extra steps [32], [29]. The first ∈ approach is based on leveraging side information that we ∗ Lemma 2. Define ci and V as in Lemma 1. Then, mostly have about the population of the outliers. In most of the M2 T 2 applications, we can have an upper-bound on no/M2 because T ∗ 2 vi vj D ci 2 = . (4) outliers are mostly associated with rare events. If we know that k k vi vi j=1 2 2 the number of outliers is less than y % of the data, matrix Y X k k k k can be constructed using (1 y) % of the data columns which are corresponding to the largest− Normalized Coherence Values. Thus, Algorithm 1 is inherently similar to CoP but Algo- The second technique is to use the adaptive column sampling rithm 1 utilizes the coherency between the columns of V. method proposed in [29] which uses subspace projection to In other word, the functionality of Algorithm 1 is similar to avoid sampling redundant columns. that of a CoP algorithm which is applied to a data matrix 4 whose non-zero singular values are normalized to 1. This is and different techniques are required to guarantee (5) in each the motivation to name the presented algorithms Normalized case, but a similar strategy is used in the proofs of all the T Coherence Pursuit. results. In contrast to Cop which analyzed di dj i,j based on the distribution of the data, it is not{| straightforward|} to T B. Symmetric Normalized Coherence Pursuit directly bound vi vj i,j. In addition, in contrast to iSearch which leveraged{| the fact|} that the optimal direction of (1) is The measure of similarity used in (4) is not symmetric. mostly orthogonal to when di is an outlier, the optimal In other word, the similarity between di and dj which is U vT v 2 direction obtained by (3) is not necessarily orthogonal to i j computed as v 2 v 2 is not equal to the similarity when di is an outlier. In the proofs of the presented k ik k ik U vT v 2 i j results, we utilized the geometry of the problem in which between dj and di which can be written as v v . k j k2k j k2 the optimal direction of (3) is not close to when di is Accordingly, we modify the measure of similarity used in (4) U an outlier. Specifically, corresponding to outlier di, we define T into a symmetric measure of similarity. The Normalized Co- ⊥ RR di T ⊥ d = T 2 and by definition d d = 1. According to i kd Rk2 i i herence Value corresponding to di using the symmetric mea- i ∗ vT v 2 the definition of ci as the optimal point of (3) and according M2 i j sure of similarity is defined as x(i)= v v . 1 T ⊥ 2 j=1 k ik2k j k2 to Lemma 1, kv k2 D di 2. We utilized this inequality Algorithm 2 uses the symmetric measure of similarity to i 2 ≤ k k P to derive the sufficient conditions which prove that (5) holds compute the Normalized Coherence Values. The numerical with high probability. The detailed proofs of all the results are experiments show that utilizing the symmetric function can provided in the appendix. notably improve the performance in most of the cases. A. Outliers Distributed on SM1−1 IV. THEORETICAL STUDIES In most of the previous works on the robust PCA problem, In this section, we present analytical performance guaran- the performance of the outlier detection method is analyzed tees for Normalize Coherence Pursuit under different models under the assumption that the outliers are randomly distributed for the distribution of the outliers. The connection between on SM1−1. This is a simple scenario because the outliers Innovation Values and Leverage Scores is utilized to analyze are unstructured and the projection of each outlier on is the ANCP method and we leave the analysis of SNCP to future not strong with high probability given that r is sufficientlyU works. In the following subsections, the performance guar- small because when the outliers are randomly distributed as antees with unstructured outliers, linearly dependant outliers, in Assumption 1, then E[ UT b 2] = r ( b is an outlying noisy inliers, and clustered outliers is provided. Moreover, in 2 M data point). The followingk assumptionk specifies the presumed contrast to most of the previous methods whose guarantees are model for the distribution of the inliers/outliers. limited to randomly distributed inliers, Normalized Coherence Pursuit is supported with theoretical guarantees even when Assumption 1. The columns of A are drawn uniformly at the inliers are clustered. The following sections provide the random from SM1−1 and the columns of B are drawn theoretical results and each theorem is followed by a short uniformly at randomU ∩ from SM1−1. discussion which highlights the important aspects of that The following theorem provides the sufficient conditions to theorem. guarantee the exact recovery of . To simplify the exposition and notation, in the presented U results, it is assumed without loss of generality that T is equal Theorem 3. Suppose D follows Assumption 1. If x is defined to the identity matrix, i.e, D = [B A]. The subspace is as in (2) and recovered using the span of the data points with the largestU ni 4 2r ni 2r (ψ 1)no Normalized Coherence Values. A sufficient condition which max log , 4 log > − + guarantees exact recovery of is that the minimum of the r − 3 δ r δ ! M1 r (6) Normalized Coherence ValuesU corresponding to the inliers is 8 2M no 2M larger than the maximum of the Normalized Coherence Values max log 1 , 16 log 1 , 3 δ M δ corresponding to the outliers, i.e., r 1 ! then (5) holds and is recovered exactly with probability at min x(i) M2 > max ( x(i) no ) . (5) U { }i=no+1 { }i=1 least 1 3δ. − This is not a necessary condition but it is easier to guarantee. Since the outliers are randomly distributed, the expected In addition, we define bT R 2 M1−r value of i 2 is equal to M1 which is nearly equal to 1 no k k 1 when r/M is small [28], [27]. Thus, Theorem 3 roughly indi- ψ = max 1 T 2 cates that if ni/r is sufficiently larger than no/M1, Normalized bi R 2 i=1 k k Coherence Values can successfully distinguish the outliers. It where R is an orthonormal basis for ⊥ and b is the ith i is important to note that ni is scaled with r while no is scaled column of B. The parameter ψ indicatesU how close the outliers with M1. It means that if r is sufficiently small and if the are to . outliers are unstructured, can be recovered exactly even if no U U Proof Strategy: Although the presented theorems consider is much larger than ni. In addition, one can observe that when different scenarios for the distribution of the inliers/outliers the outliers are unstructured, the requirements of Normalized 5

Coherence Pursuit are similar with those of CoP [28]. In the right hand side of the sufficient condition of Theorem 5. The- next subsection, we observe a clear difference between their orem 5 predicts that when o is close , the CoP Algorithm requirements when the outliers are close to . is more likely to fail. ThisU is a correctU prediction because U when and o are close, the inliers and the outliers are U U B. Outliers in an Outlying Subspace close to each other and their inner-product values are large. In addition, by comparing the sufficient conditions of Normalized Although Assumption 1 is a popular data model in the Coherence Pursuit with that of iSearch [32] with linearly literature of robust PCA, it is not a realistic assumption in dependant outliers, we can observe that the nature of the the practical scenarios. In practice, outliers can be structured sufficient conditions are similar. In the presented experiments, and they are not completely independent from each other as it it is shown that Normalized Coherence Pursuit is on a par with is assumed in Assumption 1. For instance, in anomaly event iSearch in identifying outliers with weak innovation while it detection, the outlying video frames are highly correlated or is a closed-from algorithm and its running time is much faster. in misclassified data points identification, they can belong to the same cluster [12]. In this section, we study the robustness C. Noisy Inliers against linearly dependant outliers. The following assumption specifies the presumed model for the outliers. Although exact recovery of is not feasible when the inliers are noisy but the NormalizedU Coherence Values can Assumption 2. Define subspace o with dimension ro such distinguish the outliers even in the strong presence of noise. U that o / and / o. The columns of A are randomly In this section , we present a theorem which guarantees that U ∈ U SUM∈1− U1 distributed on and the columns of B are randomly (5) holds with high probability if ni/r and Signal to Noise U ∩ SM1−1 distributed on o . Ratio (SNR) are sufficiently large. The following assumption U ∩ The following theorem provides the sufficient condition to specifies the presumed model. guarantee that (5) holds. Assumption 3. The matrix D can be expressed as Theorem 4. Suppose D follows Assumption 2. If x is defined 1 D = [B (A + E)] T . 2 as in (2) and 1+ σn

M1×ni ni 4 2r ni 2r The matrix E R p represents the presence of noise max log , 4 log > and it can be written∈ as E = σ N where the columns of r − 3 δ r r δ ! n N RM1×ni are drawn uniformly at random from SM1−1 ψno 4 2ro no 2ro ∈ UT U⊥ + ψ max log , log and σn is a positive number which controls the power of the k o k r 3 δ r δ o r o added noise. then (5) is satisfied and is recovered exactly with probability M2 Before we state the theorem, let us define vectors ti at least 1 2δ. U { }i=1 where ti =Σvi and the diagonal matrix Σ contains the non- − ′ Theorem 4 roughly states that if ni/r is sufficiently larger zero singular values of D. Note that di = U ti and ti 2 = −2t Mk2 k than no/ro, the exact recovery is guaranteed with high prob- kΣ ik2 di 2. In addition, define tmin = mini T −2 t Σ ti k k i i=no+1 ability. If ro is comparable to r, then the number of inliers −2t M2 n o should be sufficiently larger than the number of outliers. This kΣ ik2 and tmax = maxi T −2 . t Σ ti confirms our intuition about the outliers because if r is i i=no+1 o n o comparable to r and no is also large, we cannot label the Theorem 6. Suppose A and B follow Assumption 1 and D columns of B as outliers. It is informative to compare the is formed according to Assumption 3. If x is defined as in (2) requirements of Normalized Coherence Pursuit with that of and CoP. The following theorem provides the sufficient conditions 2 2 ( 1+ σn tmaxσn) to guarantee that CoP successfully distinguishes the outliers. − 2 p 1+ σn Theorem 5. Suppose D follows Assumption 2. If ni 4 2r ni 2r 2 max log , 4 log > 2σnnitmax+ ni 4 2r ni 2r r − 3 δ r δ !! max log , 4 log > r r − 3 δ r δ r ! no 4 2M no 2M ψ + max log 1 , 4 log 1 + M 3 δ M δ T 2 ni 4 2r ni 2r 1 r 1 !! U Uo + max log , 4 log + k k r 3 δ r δ 2 r !! σ ψ ni 4 2M ni 2M n + max log 1 , 4 log 1 n 4 2r n 2r 1+ σ2 M 3 δ M δ o + max log o , 4 o log o , n 1 r 1 !! r 3 δ r δ o r o then (5) holds with probability at least 1 3δ. then the CoP method proposed in [28] recovers exactly with − In this section, we considered the unstructured outliers probability at least 1 3δ. U − whose number can be much larger than M1 and ni. Consider One can observe that the requirement of Theorem 5 is much the challenging scenario that the unstructured outliers domi- stronger than that of Theorem 4 because ni/r appears on the nate the data, thus the values of all the singular values are close 6

to each other which indicates that the values of tmin and tmax TA is an arbitrary permutation matrix. The columns of Ak are are close to one. Thus, the2 sufficient condition of Theorem 6 drawn uniformly at random from the intersection of subspace (1−σn) ni M1−1 2 and S where is a -dimensional subspace. In other roughly states that 1+σ r should be sufficiently larger than k k d n n U U m n σ + o . In practise, the algorithm works better than what word, the columns of A lie in a union of subspaces k i n M1 {U }k=1 the sufficient condition implies because the proof is based on and ( 1 ... m) = where denotes the direct sum U ⊕ ⊕U U ⊕ considering the worst case scenarios. operator. The following theorem provides the sufficient conditions to D. Clustered Outliers guarantee that the computed Normalized Coherence Values In this section, we consider a different structure for the satisfy (5) with high probability. outliers. It is assumed that the outliers form a cluster outside of Theorem 8. Suppose that the distribution of the inliers follows the span of the inliers. Structured outliers are mostly associated Assumption 5 and the distribution of outliers follows Assump- with important rare events such as malignant tissues [15] tion 1. If x is defined as in (2) and or web attacks [17]. The following assumption specifies the presumed model for B. n 4 2M n 2M ϑ > (ψ 1) o +2max log 1 , 4 o log 1 A − M1 3 δ M1 δ Assumption 4. Each column of B is formed as bi = r ! 1 (q + ηfi). The unit ℓ2-norm vector q does not lie in , 1+η2 √ U m T 2 nik no SM1−1 where ϑ = inf k=1 a Uk 2 and = mini d fi i=1 are drawn uniformly at random from , and η a∈U k k A − { } kak=1 ( is a positive number. P m 4 2md nik 2md In Assumption 4, the outliers form a cluster around vector q max 3 log δ , 4 d log δ , then (5) is satisfied )i=1 which does not lie in and η determines how close they are and is recoveredq exactly with probability at least 1 3δ. to each other. The followingU theorem provides the sufficient U − conditions to guarantee that Normalized Coherence Values Theorem 8 reveals an interesting property of the Normalized distinguish the cluster of outliers. Coherence Values. According to the definition of , is m A A roughly equal to min nik k=1/d. Thus, Theorem 8 states that Theorem 7. Suppose that the distribution of inliers follows when the inliers are{ clustered,} the population of the cluster Assumption 1 and the distribution of outliers follows Assump- with the minimum population is the key factor. This property tion 4. If x is defined as in (2) and matches with our intuition about outlier detection because T ⊥ 2 if there is a cluster with few number of data points, we ni 4 2r ni 2r ψ q U 2 max log , 4 log >no k k + r − 3 δ r δ 1+ η2 could label them as outliers similar to the outliers modeled r ! in Assumption 2. The parameter ϑ = inf m aT U 2 a k=1 k 2 2 2 ∈U k k ψη no η ψ 4 2M1 no 2M1 kak=1 + max log , log + shows how well the inliers are diffused in P. Clearly, if the (1 + η2)M 1+ η2 3 δ M δ 1 r 1 ! inliers are present in all or most of the directionsU inside , a U 1 robust PCA algorithm is more likely to recover correctly. η√ψ n 2no log qT U⊥ o δ U 2 2 +2√no + , However, the presented methods do not require the inliers to 1+ η k k √M s M1 1  1 − occupy all the directions in . The reason that ϑ appeared U then (5) is satisfied and is recovered exactly with probability in the sufficient conditions is that the theorem guarantees the at least 1 4δ. U performance in the worst case scenarios. − In sharp contrast to Theorem 3, no is not scaled with M1. V. NUMERICAL EXPERIMENTS This means that when η is small (the outliers are close to In this section, SNCP and ANCP are compared with the ex- each other), ni/r should be sufficiently larger than no. When isting robust PCA approaches, including FMS [18], GMS [40], η goes to infinity, the distribution of outliers converges to the CoP [28], iSearch [32], and R1-PCA [7], and their robustness distribution of outliers in Assumption 1 and one can observe against different types of outliers is examined with both real that the sufficient condition in Theorem 7 converges to the and synthetic data. sufficient condition in Theorem 3. Remark 1. In the presented theoretical results, it was assumed that is known. When the data is noisy, one can utilize E. Outlier Detection in a Union of Subspaces rd any rank estimation algorithm and the performance of the In practice, the inliers are not randomly distributed in a algorithms is not sensitive to the chosen rank as long as rd subspace and they mostly form some structures. In this section, is sufficiently larger than r. In the presented experiments, we we assume that the inliers are clustered. It is assumed that set rd equal to the number of singular values of D which are the columns of A form m clusters and the data points in greater than s1/20 where s1 is the first singular value of D. each cluster span a d-dimensional subspace. The following assumption provides the details. A. Comparing Different Scores Assumption 5. The matrix of inliers can be written as A = In this experiment, we simulate a scenario in which the M1×ni m [A ... Am]TA where Ak R k , ni = ni, and outliers are close to . Suppose r =8, ni = 180, and no = 40. 1 ∈ k=1 k U P 7

SNCP ANCP ˆ ˆ T ˆ 1 1 such that f(k)= (I UU )dk 2 where U is the identiﬁed subspace. A trialk is considered− successfulk if 0.8 0.8

0.6 0.6 max f(k): k>no < min f(k): k no , (7) { } { ≤ } 0.4 0.4

0.2 0.2 which means that the norm of projection of all the inliers on ˆ⊥ should be smaller than the corresponding values for the 0 0 2 1 / Normalized-Coherence-Value 1 / Normalized-Coherence-Value A 0 50 100 150 200 0 50 100 150 200 U k kF Element Index Element Index outliers. Deﬁne SNR = E 2 where E is the noise component k kF iSearch CoP 1 1 which is added to the inliers. Fig. 2 shows the probability that (7) is valid versus SNR (the number of evaluation runs 0.8 0.8 was 200). In the left plot, the distribution of inliers follows 0.6 0.6 Assumption 1 and in the right plot it follows Assumption 6 0.4 0.4 with γ = 0.2. One can observe that SNCP outperforms most

Innovation Value 0.2 0.2 1 / Coherence-Value of the existing methods on both cases and the performance of

0 0 iSearch and ANCP are close. In addition, by comparing the 0 50 100 150 200 0 50 100 150 200 Element Index Element Index two plots, it can be observed that the performance of some of the robust PCA methods is sensitive to the distribution Fig. 1. The plots show the inverse of Normalized Coherence Values computed by SNCP and ANCP, the Innovation Values, and the inverse of Coherence of the inliers. For instance, FMS outperforms most of the Values. In this experiment, the first 40 columns are the outliers. other methods when the inliers are randomly distributed but its performance degrades significantly when the inliers forma cluster in . The outliers are generated as [U H] G where H RM1×4 U spans a random 4-dimensional subspace and the∈ elements 12×20 1 1 of G R are sampled independently from (0, 1). SNCP ∈ N 0.8 SNCP 0.8 ANCP Fig. 1 demonstrates Innovation Values, Coherence Values, and ANCP FMS 0.6 FMS 0.6 CoP Normalized Coherence Values computed by ANCP and SNCP CoP iSearch GMS (the first 40 columns are outliers). In this figure, we show the 0.4 iSearch 0.4 GMS R1-PCA PCA R1-PCA inverse of Coherence Values and the inverse of Normalized 0.2 0.2 Probability of Probability of PCA exact outlier detection Coherence Values to make them comparable to Innovation exact outlier detection 0 0 Values. One can observe that the scores computed by iSearch, 0 10 20 30 0 10 20 30 ANCP, and SNCP can be reliably used to form a basis for SNR SNR but the scores computed by CoP do not distinguish the outlierUs Fig. 2. The outliers are linearly dependant and they lie in a 10-dimensional subspace. In the left plot, the inliers are randomly distributed in U (Assump- well enough. As it was predicted by Theorem 5, CoP can fail tion 1) and in the right plot, the inliers form a cluster (Assumption 6 with to distinguish the outliers when they are close to . The main γ = 0.2). reason is that CoP measures the similarity betweenU the data points via a simple inner-product while iSearch and ANCP utilize the directions of innovation to measure the similarity C. Identifying the Permuted Columns between a data point and the rest of data. The functionality of The problem of regression with unknown permutation is SNCP is similar to that of ANCP while it uses a symmetric similar to the conventional regression but the correspondence measure of similarity and the plots show that it distinguishes between input variables and labels is missing or erroneous. the outliers in a more clear way. Suppose X Rd×n is the measurement matrix where n is the number∈ of measurements. Define Y Rm×n as B. Noisy Data the observation matrix which can be written as∈ Y = ΘX In this section, we examine the robustness of the robust where Θ Rm×d is the unknown matrix which is estimated PCA methods against noise. Suppose B follows Assumption 2, by the regression∈ algorithm. In the regression problem with r =5, ro = 10, M1 = 200, ni = 100, and no = 100 where o unknown permutation, the observation matrix Y is affected U is a random 10-dimensional subspace. We consider two models by an unknown permutation matrix Π, i.e., matrix Y can be for the distribution of the inliers. The first model is random written as Y =ΘXΠ where Π Rn×n. In this problem, it is distribution on SN−1 as described in Assumption 1. In the assumed that Π does not displace∈ all the columns of ΘX and U ∩ second model, it is assumed that the inliers form a cluster in only an unknown fraction of the columns are displaced. The . The following assumption describes the second model. authors of [33] showed that this special regression problem U can be translated into a robust PCA problem. Define matrix Assumption 6. Each column of matrix A is formed as ai = (d+m)×n T T T Usi r Z R as Z = ([X Y ]) , i.e., each column of where si = w + γzi, w R is a unit ℓ -norm vector, kUsik2 2 ∈ ni ∈ r−1 Z is equal to the concatenation of the corresponding columns and zi are sampled randomly from S . { }i=1 of X and Y. Suppose n > d and assume that the rank of X Since the data is noisy, exact subspace recovery is not fea- is equal to d. If Π is equal to the identity matrix, the rank sible. Instead, we examine the probability that an algorithm of Z is equal to d. In contrast, when the columns of Y are distinguishes all the outliers correctly. Define vector f RM2 displaced, the corresponding columns of Z do not lie in the ∈ 8

d-dimensional subspace which the other columns of Z lie in. SNCP ANCP Therefore, the columns of Z which are corresponding to the displaced columns can be considered as outliers and a robust 30 30 i i

PCA method can be utilized to locate them. Once they are n n located and removed, the regression problem can be solved 20 20 using the remaining measurements. 10 10 In this experiment, the elements of X and Θ are sampled 500 1000 1500 2000 500 1000 1500 2000 from (0, 1), d = 10, and m = 10. Define ni as the number n n o o of columnsN of Y which are not affected by the permutation Fig. 4. The phase transitions in presence of the unstructured outliers versus matrix and define no as the number of displaced columns. The ni and no. White indicates correct subspace recovery and black designates robust PCA methods are applied to Z to find a basis for the incorrect recovery. In this experiment, the data follows Assumption 1 with d-dimensional subspace which is spanned by the columns of M1 = 50 and r = 4. Z corresponding to the inliers. If this subspace is estimated accurately, all the displaced columns can be exactly located [33]. T k(I−UU )Uˆ kF robust PCA method is applied to each identified cluster to Define Log-Recovery Error as log U , where 10 k kF find the miss-classified data points as outliers. We refer the Uˆ is an orthonormal basis for the recovered subspace. Fig. 3 reader to [12], [28] for further details. Similar to corresponding shows Log-Recovery Error versus no where ni is fixed equal experiment in [28], we use Hopkins155 dataset which makes to 200 (the number of evaluation runs was 400). This is a the outlier detection problem challenging. In this dataset, the challenging subspace recovery task because the outliers can data points are linearly dependent and the clusters are close be close to the span of inliers and this is the main reason that to each other. Therefore, the outliers are structured and they CoP did not perform well. One can observe that SNCP and are close to the inliers. The clustering error of the clustering FMS yielded the best performance. algorithm is 30% and we compute the final clustering error after applying the robust PCA methods and updating the

0 clusters. Table I shows the clustering error after applying

SNCP different robust PCA methods. One can observe that iSearch, ANCP ANCP, SNCP, and CoP yielded better performance and the -5 FMS CoP main reason is that they leverage the clustering structure of iSearch the inliers and they are robust against structured outliers. -10 GMS R1-PCA PCA

Log-Recovery Error TABLE I -15 CLUSTERINGERRORAFTERUSINGTHEROBUST PCA METHODS TO 100 200 300 400 DETECT THE MISCLASSIFIED DATA POINTS. n o CoP FMS R1-PCA GMS iSearch PCA ANCP SNCP Fig. 3. This plot shows subspace recovery error versus the number of dis- 6.93 28.5 22.56 17.25 3.72 12.01 6.64 3.65 placed measurements. The number of measurements which are not displaced are equal to ni = 200 and the total number of measurements are equal to ni + no. F. Event Detection in Video In this experiment, we utilize the robust PCA methods to D. Unstructured Outliers identify an activity in a video files, i.e., the outlier detection Theorem 3 predicted that when the outliers are randomly methods identify the frames which contain the activity as the distributed, the number of outliers can be much larger than outlying frames/data-points. We use the Waving Tree video the number of inliers provided that ni/r is sufficiently large. file [22] where in this video a tree is smoothly waving and Suppose the data follows Assumption 1 with M1 = 50 and in the middle of the video, a person crosses the frame. The r = 4. Define Uˆ as an orthonormal basis for the recovered frames which only contain the background (the tree and the I UUT Uˆ subspace. A trial is considered successful if k( − ) kF < environment) are inliers and the few frames corresponding to kUkF 10−3 . Fig. 4 shows the phase transitions in which white the event (the presence of the person) are the outliers. The means correct subspace recover and black designates incorrect tree is smoothly waving and we use r = 3 as the rank of recovery (the number of evaluation runs was 20). The phase inliers for all the methods. We use 100 frames where frames transitions indicate that when ni/r is larger than 5, the 65 to 80 are the outlying frames. In this interval (65 to 80), the person enters the frame from left, stay in the middle, algorithms can successfully recover even if no = 2000. In addition, SNCP shows more robustnessU against the outliers and leaves from right. In this experiment, we vectorize each D when ni is small. frame and form data matrix by the vectorized frames. In addition, we reduce the dimensionality of D by projecting each column of D into the span of the first 50 left singular vectors. E. Structured Outlier Detection in Real Data Thus, D R50×100. Define Uˆ as the estimated subspace and ∈ The authors of [12], [28] proposed to use robust PCA define the residual value corresponding to data point di as T to improve the accuracy of the clustering algorithms. The di Uˆ Uˆ di . The outliers are detected as the data points k − k2 9

ANCP SNCP SVD R1-PCA iSearch FMS 0.4 0.4 0.4 0.4 0.4 0.3

0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1 0.1 0.1 Residual Value Residual Value Residual Value Residual Value Residual Value Residual Value

0 0 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Column Index Column Index Column Index Column Index Column Index Column Index Fig. 5. This figure shows the residual values computed by different methods. Each element represents a frame of the video file and frames 65 to 80 are the outlying frames. with the larger residual values. Fig. 5 shows the residual values distribution of the outliers and the distribution of the inliers computed by different methods. An important observation is were presented. In addition, it was shown with both theoretical that FMS, PCA (SVD), and R1-PCA clearly distinguished the and numerical investigations that the algorithms are robust to first and the last outlying frames but they hardly distinguish the the strong presence of noise. Although the presented methods middle outliers (frames 69 to 74). The main reason is that in are fast closed-form algorithms, it was shown that they often these frames the person does not move which means that these outperform most of the existing methods. outlying frames are very similar to each other. Fig. 5 shows that the ANCP, SNCP, and iSearch successfully distinguish all the outlying frames since they are robust to structured and linearly dependant outliers. REFERENCES

G. Running Time [1] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust In this section, we study the running time of the robust PCA principal component analysis? Journal of the ACM, 58(3):11, 2011. [2] Emmanuel J Candès and Benjamin Recht. Exact matrix completion methods. For ANCP, SNCP, CoP, and iSearch, we used 50 data via convex optimization. Foundations of Computational mathematics, points to build the basis matrix (matrix Y). Table II shows the 9(6):717–772, 2009. running times versus M2 while M1 = 200. Table III shows the [3] Emmanuel J Candes and Terence Tao. Decoding by linear programming. running times versus while . In all the runs, IEEE Transactions on Information Theory, 51(12):4203–4215, 2005. M1 M2 = 1500 [4] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Parrilo, and Alan S r = 5 and ni = 200. One can observe that CoP, SNCP, and Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM ANCP are notably fast since they are single step algorithms. Journal on Optimization, 21(2):572–596, 2011. The running time of CoP and SNCP is longer than ANCP [5] Yeshwanth Cherapanamjeri, Prateek Jain, and Praneeth Netrapalli. Thresholding based outlier robust pca. In Conference on Learning when M2 is large because their computation complexity scale Theory, pages 593–628. PMLR, 2017. 2 with M2 . GMS is also a fast algorithm when M1 is small but [6] Vartan Choulakian. L1-norm projection pursuit principal component analysis. Computational Statistics & Data Analysis, 50(6):1441–1451, its running time can be long when M1 is large because its 3 2006. computation complexity scale with M1 . [7] Chris Ding, Ding Zhou, Xiaofeng He, and Hongyuan Zha. R1- PCA: rotational invariant L1-norm principal component analysis for Proceedings of the 23rd International TABLE II robust subspace factorization. In Conference on Machine Learning (ICML), pages 281–288, Pittsburgh, RUNNINGTIMEOFTHEALGORITHMSVERSUS M2 (M1 = 200). PA, 2006. [8] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, M2 SNCP ANCP iSearch CoP FMS GMS theory, and applications. IEEE Transactions on Pattern Analysis and 500 0.0228 0.0120 0.2660 0.016 0.1130 0.0872 Machine Intelligence, 35(11):2765–2781, 2013. 1000 0.0427 0.0160 0.8983 0.0325 0.2440 0.1265 [9] Brian Everitt and Anders Skrondal. The Cambridge dictionary of 5000 0.2622 0.0428 19.4080 0.3930 0.6635 0.2926 statistics, volume 106. Cambridge University Press Cambridge, 2002. [10] Jiashi Feng, Huan Xu, and Shuicheng Yan. Robust PCA in high- dimension: A deterministic approach. In Proceedings of the 29th TABLE III International Conference on Machine Learning (ICML), Edinburgh, UK, 2012. RUNNINGTIMEOFTHEALGORITHMSVERSUS M1 (M2 = 1500). [11] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and M1 SNCP ANCP iSearch CoP FMS GMS automated cartography. Communications of the ACM, 24(6):381–395, 200 0.0614 0.0187 1.7279 0.0576 0.2978 0.1458 1981. 500 0.1456 0.0727 2.1261 0.0710 0.9574 0.6399 [12] Andrew Gitlin, Biaoshuai Tao, Laura Balzano, and John Lipor. Improv- 1000 0.3145 0.2527 2.7695 0.0900 2.7731 2.5590 ing k-subspaces via coherence pursuit. IEEE Journal of Selected Topics in Signal Processing, 2018. [13] Moritz Hardt and Ankur Moitra. Algorithms and hardness for robust The 26th Annual Conference on Learning Theory VI. CONCLUSION subspace recovery. In (COLT), pages 354–375, Princeton, NJ, 2013. It was shown that Innovation Value under the quadratic [14] Peter J Huber. Robust statistics. Springer, 2011. cost function is equivalent to Leverage Score. Two closed- [15] Seppo Karrila, Julian Hock Ean Lee, and Greg Tucker-Kellogg. A comparison of methods for data-driven cancer outlier discovery, and an form robust PCA methods were presented where the first application scheme to semisupervised predictive biomarker discovery. one was based on Leverage Score and the second one was Cancer Informatics, 10:109, 2011. inspired by the connection between Leverage Score and In- [16] Qifa Ke and Takeo Kanade. Robust L1 norm factorization in the presence of outliers and missing data by alternative convex programming. novation Value. Several theoretical performance guarantees In IEEE Computer Society Conference on Computer Vision and Pattern for the robust PCA method under different models for the Recognition (CVPR), pages 739–746, San Diego, CA, 2005. 10

[17] Christopher Kruegel and Giovanni Vigna. Anomaly detection of web- VII. PROOFS based attacks. In Proceedings of the 10th ACM Conference on Computer and Communications Security (CCS), pages 251–261, Washington, DC, In this section, the proofs for the presented theoretical 2003. results are provided. In most of the proofs, we frequently use [18] Gilad Lerman and Tyler Maunu. Fast, robust and non-convex subspace the following useful lemmas. recovery. Information and Inference: A Journal of the IMA, 7(2):277– 336, 2017. Lemma 9. [28] Suppose g1, ..., gn are i.i.d. random vectors [19] Gilad Lerman, Michael B McCoy, Joel A Tropp, and Teng Zhang. N−1 N Robust computation of linear models by convex relaxation. Foundations distributed uniformly on the unit sphere S in R . If N > of Computational Mathematics, 15(2):363–410, 2015. 2, then [20] Gilad Lerman, Teng Zhang, et al. Robust recovery of multiple subspaces by geometric lp minimization. The Annals of Statistics, 39(5):2686– n T 2 n 4 2N n 2N 2715, 2011. sup (u gi) + max log , 4 log u ≤ N 3 δ N δ [21] Guoying Li and Zhonglian Chen. Projection-pursuit approach to robust k k=1 i=1 r ! dispersion matrices and principal components: primary theory and monte Xn carlo. Journal of the American Statistical Association, 80(391), 1985. T 2 n 4 2N n 2N [22] Liyuan Li, Weimin Huang, Irene Yu-Hua Gu, and Qi Tian. Statistical inf (u gi) max log , 4 log kuk=1 ≥ N − 3 δ N δ modeling of complex backgrounds for foreground object detection. IEEE i=1 r ! Transactions on Image Processing, 13(11):1459–1472, 2004. X [23] Panos P Markopoulos, George N Karystinos, and Dimitris A Pados. Op- with probability at least 1 δ. timal algorithms for l1-subspace signal processing. IEEE Transactions − Lemma 10. [2] Suppose orthonormal matrix F RN×r on Signal Processing, 62(19):5046–5058, 2014. ∈ [24] Michael McCoy and Joel A Tropp. Two proposals for robust PCA using spans a random r-dimensional subspace. For a given vector semidefinite programming. Electronic Journal of Statistics, 5:1123– c RN×1 1160, 2011. ∈ [25] Amanda F Mejia, Mary Beth Nebel, Ani Eloyan, Brian Caffo, and T c1r¯ −3 Martin A Lindquist. Pca leverage: outlier detection for high-dimensional P c F 2 > 1 c2N log N, functional magnetic resonance imaging data. Biostatistics, 18(3), 2017. "k k N # ≤ − [26] Tormod Næs. Leverage and influence measures for principal component r regression. Chemometrics and Intelligent Laboratory Systems, 5(2), where c1 and c2 are constant real numbers and r¯ = 1989. . [27] Dohyung Park, Constantine Caramanis, and Sujay Sanghavi. Greedy max(r, log N) subspace clustering. In Advances in Neural Information Processing Lemma 11. [28] Suppose g g are i.i.d. random vectors Systems (NIPS), pages 2753–2761, Montreal, Canada, 2014. 1, ..., n N−1 N [28] Mostafa Rahmani and George Atia. Coherence pursuit: Fast, simple, distributed uniformly on the unit sphere S in R . If N > and robust subspace recovery. In International Conference on Machine 2, then Learning (ICML), pages 2864–2873, Long Beach, 2017. [29] Mostafa Rahmani and George K Atia. Coherence pursuit: Fast, simple, n 1 and robust principal component analysis. IEEE Transactions on Signal T n 2n log δ sup u gi < +2√n + Processing, 65(23):6260–6275, 2017. u | | √ s N 1 k k=1 i=1 N [30] Mostafa Rahmani and George K Atia. Innovation pursuit: A new ap- X − proach to subspace clustering. IEEE Transactions on Signal Processing, with probability at least 1 δ. 65(23):6276–6291, 2017. − [31] Mostafa Rahmani and George K Atia. Subspace clustering via optimal direction search. IEEE Signal Processing Letters, 24(12):1793–1797, 2017. A. Proof of Lemma 1 [32] Mostafa Rahmani and Ping Li. Outlier detection and robust pca using The optimal point of (3) is equivalent to the optimal point a convex measure of innovation. In Advances in Neural Information Processing Systems (NeurIPS), pages 14200–14210, Vancouver, Canada, of 2019. min cT DDT c subject to cT d =1 [33] Martin Slawski, Mostafa Rahmani, and Ping Li. A sparse representation- c i based approach to linear regression with partially shuffled labels. In Conference on Uncertainty in Artificial Intelligence (UAI), Tel Aviv, whose Lagrangian function is as follows Israel, 2019. T T T [34] Mahdi Soltanolkotabi and Emmanuel J Candes. A geometric analysis c DD c + γ(c di 1) , of subspace clustering with outliers. The Annals of Statistics, pages − 2195–2238, 2012. where γ is the Lagrangian multiplier. Assuming that D is a [35] Manolis C Tsakiris and RenéVidal. Dual principal component pursuit. In IEEE International Conference on Computer Vision Workshops, pages full rank matrix, the Lagrangian function can be used to find 10–18, Santiago, Chile, 2015. the optimal point as [36] Huan Xu, Constantine Caramanis, and Shie Mannor. Principal component analysis with contaminated data: The high dimensional case. (DDT )−1 d c∗ i (8) arXiv:1002.4658, 2010. i = T T −1 . [37] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via di (DD ) di outlier pursuit. In Advances in Neural Information Processing Systems Accordingly, (NIPS), pages 2496–2504, Vancouver, Canada, 2010. [38] Chong You, Daniel P Robinson, and René Vidal. Provable self- dT (DDT )−1DDT (DDT )−1 d representation based outlier detection in a union of subspaces. In IEEE DT c∗ 2 = i i = i 2 T T −1 2 Conference on Computer Vision and Pattern Recognition (CVPR), pages k k (di (DD ) di) 4323–4332, Honolulu, HI, 2017. 1 [39] Teng Zhang. Robust subspace recovery by Tyler’s m-estimator. Infor- T T −1 . mation and Inference: A Journal of the IMA, 5(1):1–21, 2016. di (DD ) di [40] Teng Zhang and Gilad Lerman. A novel m-estimator for robust PCA. ′ The Journal of Machine Learning Research, 15(1):749–808, 2014. Each data point di can be written as U Σvi. Thus, 1 1 DT c∗ 2 i 2 = T T −1 = T . k k di (DD ) di vi vi 11

B. Proof of Lemma 2 In addition, according to the deﬁnition of d⊥,

According to (8), ⊥ 1 d 2 = T . T −1 2 k k d R 2 (DD ) di k k DT c∗ 2 = DT k i k2 dT (DDT )−1 d Thus, we can conclude that i i 2 M2 T T −1 2 no di (DD ) dj P DT c∗ 2 = . 2 > T 2 + dT (DDT )−1 d "k k d R 2M1 i=j i i k k (12) X ′ 1 4 2M1 no 2M1 Each data point di can be written as U Σvi. Therefore, T 2 max log , log < δ . d R 2 3 δ M1 δ ! # M2 k k r vT v 2 DT c∗ 2 = i j . Therefore, according to (11) and (12), if (6) is satisfied, then k i k2 vT v i=j i i (5) holds and is recovered exactly with probability at least X 1 3δ. U − C. Proof of Theorem 3 Define v as the columns of V Rrd×M2 (the matrix of D. Proof of Theorem 4 right singular vectors) which is corresponding∈ to data point d. The procedure to prove Theorem 4 is similar to the proce- The defined Normalized Coherence Value can be written as dure used to prove Theorem 3, i.e., we guarantee that a lower- bound for the Normalized Coherence Values corresponding to c∗T Hc∗ the inliers is larger than an upper-bound for the Normalized where H = DDT and c∗ is the optimal point of Coherence Values corresponding to the outliers. First we establish the lower-bound for the Normalized Coherence Values min cT Hc subject to cT d =1 . (9) c corresponding to the inliers. A Normalized Coherence Value In order to guarantee exact recovery of , it is enough to can be written as U show that (5) holds. Accordingly, we establish a lower bound ∗T ∗ T ∗ 2 T ∗ 2 c Hc = A c 2 + B c 2 . for the Normalized Coherence Values corresponding to the k k k k When d is an inlier, c∗U 1 which means inliers and an upper-bound for the Normalized Coherence Val- k k2 ≥ ues corresponding to the outliers and we derive the sufficient DT c∗ 2 AT c∗ 2 UT c∗ 2 inf δT A 2 2 2 2 δ∈U 2 conditions to guarantee that the lower-bound is larger than the k k ≥k k ≥k k δ =1 k k k k2 (13) upper-bound. T 2 inf δ A 2 , ≥ δ∈U k k Suppose d is an inlier. Then, the linear constraint of (9) kδk2=1 ensures that c∗ 1 and UT c∗ 1. Therefore, k k2 ≥ k k2 ≥ where c∗ was defined as the optimal point of (9). Since the c∗T Hc∗ = AT c∗ 2 + BT c∗ 2 inliers are randomly distributed on SM1−1, according to k k2 k k2 U ∩ T ∗ 2 T 2 ∗ 2 2 2 Lemma 9 U c 2 inf δ A 2 + c 2 inf δ B 2 ≥k k δ∈U k k k k δk2=1k k kδk2=1 (10) P T ∗ 2 T 2 2 2 D c 2 < inf δ A 2 + inf δ B 2 . k k ≥ δ∈U k k δk2=1k k " kδk2=1 (14) n 4 2r n 2r We can use Lemma 9 to establish a lower bound for the i max log , 4 i log < δ . Normalized Coherence Values corresponding to the inliers as r − 3 δ r r δ ! # follows In order to establish the upper-bound for the Normalized Coherence Values corresponding to the outliers, first we define T ∗ 2 ni 4 2r ni 2r T P D c < max log , 4 log ⊥ RR d ⊥ 2 d = dT R 2 where R is an orthonormal basis for . Vector "k k r − 3 δ r r δ ! k k2 U (11) d⊥ lies in ⊥ and dT d⊥ =1. Thus, no 4 2M1 no 2M1 U + max log , 4 log < 2δ . T ∗ 2 T ⊥ 2 T ⊥ 2 M 3 δ M δ D c 2 D d 2 = B d 2 . 1 − r 1 ! # k k ≤k k k k In addition, UT d⊥ d⊥ UT U⊥ . Therefore, we can Now, we need to establish an upper-bound for the Normal- k o k2 ≤k k2k o k ized Coherence Values corresponding to the outliers. Define use Lemma 9 to establish the following upper-bound ⊥ RRT d ⊥ d = dT R 2 where R is an orthonormal basis for (which ψn k k2 U P T ∗ 2 T ⊥ o was defined as the complement of ). By definition, D c 2 > Uo U + U "k k k k ro T T (15) T ⊥ d RR d d d = =1 . 4 2ro no 2ro dT R 2 ψ max log , log < δ . 2 3 δ r δ k k r o !# Since c∗ is the optimal point and d⊥ satisfies the linear According to (14) and (15), if the sufficient condition of constraint, Theorem 4 is satisfied, then (5) holds and is recovered DT c∗ 2 DT d⊥ 2 = BT d⊥ 2 . exactly with probability at least 1 2δ. U k k2 ≤k k2 k k2 − 12

E. Proof of Theorem 5 Since d⊥ lies in ⊥, qT d⊥ d⊥ qT U⊥ . Moreover, U ≤ k k2 k k2 The Coherence Value corresponding to data point according to Lemma 11, d is equal to no 1 M2 T ⊥ ⊥ no 2no log δ T 2 T 2 T 2 T 2 vi d d 2 +2√no + (18) (d di) = d D 2 = d A 2 + d B 2 . ≤k k √M s M1 1  k k k k k k i=1 1 − i=1 X X   We use similar procedure which was used to prove Theorem 4. with probability at least 1 δ. Accordingly, using (18) and First we establish a lower-bound for Coherence Values corre- Lemma 9, we can establish− the following bound sponding to the inliers. Suppose d is an inlier. Similar to (13) and (14), P 2 T ⊥ 2 T ⊥ 2 (1 + η ) B d 2 > ψno q U 2+ ni 4 2r ni 2r k k k k P DT d 2 < max log , 4 log < δ . "(16) k k2 r − 3 δ r δ " r ! # ψη2n 4 2M n 2M o + η2ψ max log 1 , o log 1 + Next, we need to establish an upper-bound for Coherence M 3 δ M δ 1 r 1 ! Values corresponding to the outliers. Since d o, ∈U 1 T 2 T 2 T 2 T 2 2no log d D sup δ B + U Uo sup δ A . T ⊥ no δ 2 2 2 η ψ q U +2√no + 2δ . k k ≤ δ∈Uo k k k k δ∈U k k 2 kδk2=1 kδk2=1 k k √M1 s M1 1  # ≤ p − Accordingly, using Lemma 9, we can conclude that when d   is an outlier Thus, if the sufﬁcient condition of Theorem 7 is satisﬁed, then (5) holds and is recovered exactly with probability at 1 4δ. P T 2 U − D d 2 > "k k

T 2 ni 4 2r ni 2r G. Proof of Theorem 8 U Uo + max log , 4 log + (17) k k r 3 δ r δ r !! The only difference between this Theorem and Theorem 3 is

no 4 2ro no 2ro the difference between the presumed model for the distribution + max log , 4 log < 2δ . of the inliers. We can not use Lemma 9 to establish a lower- ro 3 δ ro δ r # T 2 bound for inf δ A 2. According to the clustering structure δ∈U k k According to (16) and (17), if the sufficient condition of kδk2=1 Theorem 5 is satisfied, the minimum of Coherence Values of the inliers, Corresponding to the inliers is larger than the maximum of m T 2 Coherence Values Corresponding to the outliers and is T 2 T 2 T UkUk δ U inf A δ 2 = inf δ Uk 2 Ak T recovered exactly with probability at least 1 3δ. δ∈U k k δ∈U k k δ Uk 2 ≥ − kδk=1 kδk=1 i=1 2 X k k (19) m F. Proof of Theorem 7 T 2 T 2 inf δ Uk 2 inf δk Ak 2 . δ∈U k k δk ∈Uk k k Similar to the proofs of the previous theorems, first we kδk=1 i=1 kδk=1 ! establish a lower-bound for the Normalized Coherence Values X corresponding to the inliers and an upper-bound for the Define Normalized Coherence Values corresponding to the outliers. Subsequently, the final sufficient conditions are derived to m T 2 guarantee that the lower-bound in higher than the upper-bound = min inf δk Ak 2 . k  δk ∈Uk  B ( =1 k k ) with high probability. kδk k=1 Using (13), we can bound the Normalized Coherence Values   According to (19) and the definition of , corresponding to the inliers as stated in (14). B ⊥ RRT d Similar to the proof of Theorem 4, we define d = T 2 kd Rk2 m ⊥ T ⊥ T 2 T 2 where R is an orthonormal basis for . Since d d =1, inf A δ 2 δ Uk 2 = ϑ . U δ∈U k k ≥B k k B T kδk=1 i=1 c∗T Hc∗ d⊥ Hd⊥ = BT d⊥ 2 . X ≤ k k2 In addition, Using Lemma 9 and the definition of , we can bound A B no as follows 1 2 BT d⊥ 2 = qT d⊥ + ηvT d⊥ k k2 1+ η2 i ≤ i=1 X P < δ . (20) 1 "B A# ≤ qT d⊥ 2 2 no( ) + 1+ η Therefore, according to (19) and (20), if the sufficient condi- no no 2 T ⊥ 2 T ⊥ T ⊥ tions of Theorem 8 are satisfied, then (5) is satisfied and is η vi d +2η q d vi d . U | | recovered exactly with probability at least 1 3δ. i=1 i=1 ! − X X

H. Proof of Theorem 6 We guarantee that a lower-bound for the Normalized Co- herence Values corresponding to the inliers is larger than an upper-bound for the Normalized Coherence Values corresponding to the outliers. First we establish the lower-bound for the Normalized Coherence Values corresponding to the inliers. Note that 1 c∗T Hc∗ A E T c∗ 2 BT c∗ 2 = 2 ( + ) 2 + 2 (1 + σn)k k k k where c∗ was defined in (9). In addition, the optimal point of T −1 ∗ (DD ) di (9) when d = di is equal to ci = dT DDT −1d . According i ( ) i −2 M2 ∗ kΣ tik2 to the definition of vectors t , c 2 . i i=no+1 i 2 = tT − t i Σ i 1 { } T k∗ k When di is an inlier, 2 ((ai + ei) c )=1 . Thus, √1+σn 2 T ∗ 2 1+ σ + t σn U c 1+ σ t σn n max ≥k i k2 ≥ n − max wherep tmax and tmin were definedp in Section (IV-C). In ∗ addition, when di is an inlier, t c t and min ≤k i k2 ≤ max M2 (A + E)T c∗ 2 AT c∗ 2 + ET c∗ 2 2 (aT c∗)(eT c∗) k k2 ≥k k2 k k2 − i i i=no+1 T ∗ 2 T ∗ 2 X 2 A c + E c 2σnnit . ≥k k2 k k2 − max Therefore, we can bound the Normalize Coherence Value corresponding to an inliers as follows 2 2 ( 1+ σ t σn) DT c∗ 2 n − max inf δT A 2+ 2 2 δ 2 k k ≥ 1+ σn ∈U k k p kδk2=1 (21) σ2 t2 n min T N 2 2 T B 2 2 2 inf δ 2 + tmin inf δ 2 2σnnitmax . 1+ σn kδk2=1k k kδk2=1k k − Using (21) and Lemma 9, we can establish the flowing bound DT c∗ 2 k k2 ≥ 2 2 ( 1+ σn tmaxσn) ni 4 2r ni 2r − 2 max log , 4 log + p 1+ σn r − 3 δ r r δ !! σ2 t2 n + n n min + t2 i o 1+ σ2 min M − 2 1

4 2M1 max(ni,no) 2r 2 2max log , 4 log 2σnnitmax 3 δ s M1 δ  ! − with probability at least 1 3δ.  Next Suppose that d is− an outlier. Similar to the proof of ⊥ RRT d Theorem 3, define d = T 2 where R is an orthonormal kd Rk2 basis for ⊥. Since dT d⊥ =1, U 1 c∗T Hc∗ DT d⊥ 2 BT d⊥ 2 ET d⊥ 2 2 = 2 + 2 2 ≤k k k k 1+ σn k k 1 n 4 2M n 2M o + max log 1 , 4 o log 1 + ≤ dT R 2 M 3 δ M δ k k2 1 r 1 !! σ2 n 4 2M n 2M n i + max log 1 , 4 i log 1 , (1 + σ2 ) dT R 2 M 3 δ M δ n k k2 1 r 1 !! with probability at least 1 2δ. According to the established lower-bound/upper-bound,− if the sufficient conditions of The- orem 6 are satisfied, the Normalized Coherence Values satisfy (5) with probability at least 1 7δ. − This figure "mostafa1.jpeg" is available in "jpeg" format from:

http://arxiv.org/ps/2106.12190v1 This figure "pingli.jpg" is available in "jpg" format from:

http://arxiv.org/ps/2106.12190v1